CS236781: Deep Learning on Computational Accelerators¶

Homework Assignment 3¶

Faculty of Computer Science, Technion.

Submitted by:

# Name Id email
Student 1 [Hay Elmaliah] [315777433] [hay.e@campus.technion.ac.il]
Student 2 [Orad Barel] [311288203] [oradbarel@campus.technion.ac.il]

Introduction¶

In this assignment we'll learn to generate text with a deep multilayer RNN network based on GRU cells. Then we'll focus our attention on image generation using a variational autoencoder. We will then shift our focus to sentiment analysis: First by training a transformer-style encoder, and then by fine-tuning a pre-trained model from Hugging Face.

General Guidelines¶

  • Please read the getting started page on the course website. It explains how to setup, run and submit the assignment.
  • This assignment requires running on GPU-enabled hardware. Please read the course servers usage guide. It explains how to use and run your code on the course servers to benefit from training with GPUs.
  • The text and code cells in these notebooks are intended to guide you through the assignment and help you verify your solutions. The notebooks do not need to be edited at all (outside of a small code block in part 4). The only exception is to fill your name(s) in the above cell before submission, and implementing a small code block in Part 4. Please do not remove sections or change the order of any cells.
  • All your code (and even answers to questions) should be written in the files within the python package corresponding the assignment number (hw1, hw2, etc). You can of course use any editor or IDE to work on these files.

Contents¶

  • Part1: Sequence Models
    • Text generation with a char-level RNN
    • Obtaining the corpus
    • Data Preprocessing
    • Dataset Creation
    • Model Implementation
    • Generating text by sampling
    • Training
    • Generating a work of art
    • Questions
  • Part 2: Variational Autoencoder:
    • Obtaining the dataset
    • The Variational Autoencoder
    • Model Implementation
    • Loss Implementation
    • Sampling
    • Training
    • Questions
  • Part 3: Transformer Encoder
    • Reminder: scaled dot product attention
    • Sliding window attention
    • Multihead Sliding window attention
    • Sentiment analysis
    • Obtaining the dataset
    • Tokenizer
    • Transformer Encoder
    • Training
    • Questions
  • Part 4: Fine-tuning a pretrained language model
    • Loading the dataset
    • Tokenizer
    • Loading pre-trained model
    • Fine-tuning
    • Questions

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$

Part 1: Sequence Models¶

In this part we will learn about working with text sequences using recurrent neural networks. We'll go from a raw text file all the way to a fully trained GRU-RNN model and generate works of art!

In [1]:
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re

import numpy as np
import torch
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2
In [2]:
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
Using device: cpu

Text generation with a char-level RNN¶

Obtaining the corpus¶

Let's begin by downloading a corpus containing all the works of William Shakespeare. Since he was very prolific, this corpus is fairly large and will provide us with enough data for obtaining impressive results.

In [3]:
CORPUS_URL = 'https://github.com/cedricdeboom/character-level-rnn-datasets/raw/master/datasets/shakespeare.txt'
DATA_DIR = pathlib.Path.home().joinpath('.pytorch-datasets')

def download_corpus(out_path=DATA_DIR, url=CORPUS_URL, force=False):
    pathlib.Path(out_path).mkdir(exist_ok=True)
    out_filename = os.path.join(out_path, os.path.basename(url))
    
    if os.path.isfile(out_filename) and not force:
        print(f'Corpus file {out_filename} exists, skipping download.')
    else:
        print(f'Downloading {url}...')
        with urllib.request.urlopen(url) as response, open(out_filename, 'wb') as out_file:
            shutil.copyfileobj(response, out_file)
        print(f'Saved to {out_filename}.')
    return out_filename
    
corpus_path = download_corpus()
Corpus file /home/hay.e/.pytorch-datasets/shakespeare.txt exists, skipping download.

Load the text into memory and print a snippet:

In [4]:
with open(corpus_path, 'r', encoding='utf-8') as f:
    corpus = f.read()

print(f'Corpus length: {len(corpus)} chars')
print(corpus[7:1234])
Corpus length: 6347703 chars
ALLS WELL THAT ENDS WELL

by William Shakespeare

Dramatis Personae

  KING OF FRANCE
  THE DUKE OF FLORENCE
  BERTRAM, Count of Rousillon
  LAFEU, an old lord
  PAROLLES, a follower of Bertram
  TWO FRENCH LORDS, serving with Bertram

  STEWARD, Servant to the Countess of Rousillon
  LAVACHE, a clown and Servant to the Countess of Rousillon
  A PAGE, Servant to the Countess of Rousillon

  COUNTESS OF ROUSILLON, mother to Bertram
  HELENA, a gentlewoman protected by the Countess
  A WIDOW OF FLORENCE.
  DIANA, daughter to the Widow

  VIOLENTA, neighbour and friend to the Widow
  MARIANA, neighbour and friend to the Widow

  Lords, Officers, Soldiers, etc., French and Florentine  

SCENE:
Rousillon; Paris; Florence; Marseilles

ACT I. SCENE 1.
Rousillon. The COUNT'S palace

Enter BERTRAM, the COUNTESS OF ROUSILLON, HELENA, and LAFEU, all in black

  COUNTESS. In delivering my son from me, I bury a second husband.
  BERTRAM. And I in going, madam, weep o'er my father's death anew;
    but I must attend his Majesty's command, to whom I am now in
    ward, evermore in subjection.
  LAFEU. You shall find of the King a husband, madam; you, sir, a
    father. He that so generally is at all times good must of
    

Data Preprocessing¶

The first thing we'll need is to map from each unique character in the corpus to an index that will represent it in our learning process.

TODO: Implement the char_maps() function in the hw3/charnn.py module.

In [5]:
import hw3.charnn as charnn

char_to_idx, idx_to_char = charnn.char_maps(corpus)
print(char_to_idx)

test.assertEqual(len(char_to_idx), len(idx_to_char))
test.assertSequenceEqual(list(char_to_idx.keys()), list(idx_to_char.values()))
test.assertSequenceEqual(list(char_to_idx.values()), list(idx_to_char.keys()))
{'\n': 0, ' ': 1, '!': 2, '"': 3, '$': 4, '&': 5, "'": 6, '(': 7, ')': 8, ',': 9, '-': 10, '.': 11, '0': 12, '1': 13, '2': 14, '3': 15, '4': 16, '5': 17, '6': 18, '7': 19, '8': 20, '9': 21, ':': 22, ';': 23, '<': 24, '?': 25, 'A': 26, 'B': 27, 'C': 28, 'D': 29, 'E': 30, 'F': 31, 'G': 32, 'H': 33, 'I': 34, 'J': 35, 'K': 36, 'L': 37, 'M': 38, 'N': 39, 'O': 40, 'P': 41, 'Q': 42, 'R': 43, 'S': 44, 'T': 45, 'U': 46, 'V': 47, 'W': 48, 'X': 49, 'Y': 50, 'Z': 51, '[': 52, ']': 53, '_': 54, 'a': 55, 'b': 56, 'c': 57, 'd': 58, 'e': 59, 'f': 60, 'g': 61, 'h': 62, 'i': 63, 'j': 64, 'k': 65, 'l': 66, 'm': 67, 'n': 68, 'o': 69, 'p': 70, 'q': 71, 'r': 72, 's': 73, 't': 74, 'u': 75, 'v': 76, 'w': 77, 'x': 78, 'y': 79, 'z': 80, '}': 81, '\ufeff': 82}

Seems we have some strange characters in the corpus that are very rare and are probably due to mistakes. To reduce the length of each tensor we'll need to later represent our chars, it's best to remove them.

TODO: Implement the remove_chars() function in the hw3/charnn.py module.

In [6]:
corpus, n_removed = charnn.remove_chars(corpus, ['}','$','_','<','\ufeff'])
print(f'Removed {n_removed} chars')

# After removing the chars, re-create the mappings
char_to_idx, idx_to_char = charnn.char_maps(corpus)
Removed 34 chars

The next thing we need is an embedding of the chracters. An embedding is a representation of each token from the sequence as a tensor. For a char-level RNN, our tokens will be chars and we can thus use the simplest possible embedding: encode each char as a one-hot tensor. In other words, each char will be represented as a tensor whos length is the total number of unique chars (V) which contains all zeros except at the index corresponding to that specific char.

TODO: Implement the functions chars_to_onehot() and onehot_to_chars() in the hw3/charnn.py module.

In [7]:
# Wrap the actual embedding functions for calling convenience
def embed(text):
    return charnn.chars_to_onehot(text, char_to_idx)

def unembed(embedding):
    return charnn.onehot_to_chars(embedding, idx_to_char)

text_snippet = corpus[3104:3148]
print(text_snippet)
print(embed(text_snippet[0:3]))

test.assertEqual(text_snippet, unembed(embed(text_snippet)))
test.assertEqual(embed(text_snippet).dtype, torch.int8)
brine a maiden can season her praise in.
   
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
         0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0]], dtype=torch.int8)

Dataset Creation¶

We wish to train our model to generate text by constantly predicting what the next char should be based on the past. To that end we'll need to train our recurrent network in a way similar to a classification task. At each timestep, we input a char and set the expected output (label) to be the next char in the original sequence.

We will split our corpus into shorter sequences of length S chars (see question below). Each sample we provide our model with will therefore be a tensor of shape (S,V) where V is the embedding dimension. Our model will operate sequentially on each char in the sequence. For each sample, we'll also need a label. This is simply another sequence, shifted by one char so that the label of each char is the next char in the corpus.

TODO: Implement the chars_to_labelled_samples() function in the hw3/charnn.py module.

In [8]:
# Create dataset of sequences
seq_len = 64
vocab_len = len(char_to_idx)

# Create labelled samples
samples, labels = charnn.chars_to_labelled_samples(corpus, char_to_idx, seq_len, device)
print(f'samples shape: {samples.shape}')
print(f'labels shape: {labels.shape}')

# Test shapes
num_samples = (len(corpus) - 1) // seq_len
test.assertEqual(samples.shape, (num_samples, seq_len, vocab_len))
test.assertEqual(labels.shape, (num_samples, seq_len))

# Test content
for _ in range(1000):
    # random sample
    i = np.random.randint(num_samples, size=(1,))[0]
    # Compare to corpus
    test.assertEqual(unembed(samples[i]), corpus[i*seq_len:(i+1)*seq_len], msg=f"content mismatch in sample {i}")
    # Compare to labels
    sample_text = unembed(samples[i])
    label_text = str.join('', [idx_to_char[j.item()] for j in labels[i]])
    test.assertEqual(sample_text[1:], label_text[0:-1], msg=f"label mismatch in sample {i}")
samples shape: torch.Size([99182, 64, 78])
labels shape: torch.Size([99182, 64])

Let's print a few consecutive samples. You should see that the text continues between them.

In [9]:
import re
import random

i = random.randrange(num_samples-5)
for i in range(i, i+5):
    test.assertEqual(len(samples[i]), seq_len)
    s = re.sub(r'\s+', ' ', unembed(samples[i])).strip()
    print(f'sample [{i}]:\n\t{s}')
sample [16986]:
	you, coz, Of this young Percy's pride? The prisoners Wh
sample [16987]:
	ich he in this adventure hath surpris'd To his own use he ke
sample [16988]:
	eps, and sends me word I shall have none but Mordake Earl of
sample [16989]:
	Fife. West. This is his uncle's teaching, this Worcester,
sample [16990]:
	Malevolent to you In all aspects, Which makes him prune h

As usual, instead of feeding one sample at a time into our model's forward we'll work with batches of samples. This means that at every timestep, our model will operate on a batch of chars that are from different sequences. Effectively this will allow us to parallelize training our model by dong matrix-matrix multiplications instead of matrix-vector during the forward pass.

An important nuance is that we need the batches to be contiguous, i.e. sample $k$ in batch $j$ should continue sample $k$ from batch $j-1$. The following figure illustrates this:

If we naïvely take consecutive samples into batches, e.g. [0,1,...,B-1], [B,B+1,...,2B-1] and so on, we won't have contiguous sequences at the same index between adjacent batches.

To accomplish this we need to tell our DataLoader which samples to combine together into one batch. We do this by implementing a custom PyTorch Sampler, and providing it to our DataLoader.

TODO: Implement the SequenceBatchSampler class in the hw3/charnn.py module.

In [10]:
from hw3.charnn import SequenceBatchSampler

sampler = SequenceBatchSampler(dataset=range(32), batch_size=10)
sampler_idx = list(sampler)
print('sampler_idx =\n', sampler_idx)

# Test the Sampler
test.assertEqual(len(sampler_idx), 30)
batch_idx = np.array(sampler_idx).reshape(-1, 10)
for k in range(10):
    test.assertEqual(np.diff(batch_idx[:, k], n=2).item(), 0)
sampler_idx =
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]

Even though we're working with sequences, we can still use the standard PyTorch Dataset/DataLoader combo. For the dataset we can use a built-in class, TensorDataset to return tuples of (sample, label) from the samples and labels tensors we created above. The DataLoader will be provided with our custom Sampler so that it generates appropriate batches.

In [11]:
import torch.utils.data

# Create DataLoader returning batches of samples.
batch_size = 32

ds_corpus = torch.utils.data.TensorDataset(samples, labels)
sampler_corpus = SequenceBatchSampler(ds_corpus, batch_size)
dl_corpus = torch.utils.data.DataLoader(ds_corpus, batch_size=batch_size, sampler=sampler_corpus, shuffle=False)

Let's see what that gives us:

In [12]:
print(f'num batches: {len(dl_corpus)}')

x0, y0 = next(iter(dl_corpus))
print(f'shape of a batch of samples: {x0.shape}')
print(f'shape of a batch of labels: {y0.shape}')
num batches: 3100
shape of a batch of samples: torch.Size([32, 64, 78])
shape of a batch of labels: torch.Size([32, 64])

Now lets look at the same sample index from multiple batches taken from our corpus.

In [13]:
# Check that sentences in in same index of different batches complete each other.
k = random.randrange(batch_size)
for j, (X, y) in enumerate(dl_corpus,):
    print(f'=== batch {j}, sample {k} ({X[k].shape}): ===')
    s = re.sub(r'\s+', ' ', unembed(X[k])).strip()
    print(f'\t{s}')
    if j==4: break
=== batch 0, sample 19 (torch.Size([64, 78])): ===
	good must of necessity hold his virtue to you, whose worthin
=== batch 1, sample 19 (torch.Size([64, 78])): ===
	om her cheek. No more of this, Helena; go to, no more, lest
=== batch 2, sample 19 (torch.Size([64, 78])): ===
	ion Must die for love. 'Twas pretty, though a plague, To
=== batch 3, sample 19 (torch.Size([64, 78])): ===
	ttle can be said in 't; 'tis against the rule of nature. To
=== batch 4, sample 19 (torch.Size([64, 78])): ===
	he shall. God send him well! The court's a learning-place,

Model Implementation¶

Finally, our data set is ready so we can focus on our model.

We'll implement here is a multilayer gated recurrent unit (GRU) model, with dropout. This model is a type of RNN which performs similar to the well-known LSTM model, but it's somewhat easier to train because it has less parameters. We'll modify the regular GRU slightly by applying dropout to the hidden states passed between layers of the model.

The model accepts an input $\mat{X}\in\set{R}^{S\times V}$ containing a sequence of embedded chars. It returns an output $\mat{Y}\in\set{R}^{S\times V}$ of predictions for the next char and the final hidden state $\mat{H}\in\set{R}^{L\times H}$. Here $S$ is the sequence length, $V$ is the vocabulary size (number of unique chars), $L$ is the number of layers in the model and $H$ is the hidden dimension.

Mathematically, the model's forward function at layer $k\in[1,L]$ and timestep $t\in[1,S]$ can be described as

$$ \begin{align} \vec{z_t}^{[k]} &= \sigma\left(\vec{x}^{[k]}_t {\mattr{W}_{\mathrm{xz}}}^{[k]} + \vec{h}_{t-1}^{[k]} {\mattr{W}_{\mathrm{hz}}}^{[k]} + \vec{b}_{\mathrm{z}}^{[k]}\right) \\ \vec{r_t}^{[k]} &= \sigma\left(\vec{x}^{[k]}_t {\mattr{W}_{\mathrm{xr}}}^{[k]} + \vec{h}_{t-1}^{[k]} {\mattr{W}_{\mathrm{hr}}}^{[k]} + \vec{b}_{\mathrm{r}}^{[k]}\right) \\ \vec{g_t}^{[k]} &= \tanh\left(\vec{x}^{[k]}_t {\mattr{W}_{\mathrm{xg}}}^{[k]} + (\vec{r_t}^{[k]}\odot\vec{h}_{t-1}^{[k]}) {\mattr{W}_{\mathrm{hg}}}^{[k]} + \vec{b}_{\mathrm{g}}^{[k]}\right) \\ \vec{h_t}^{[k]} &= \vec{z}^{[k]}_t \odot \vec{h}^{[k]}_{t-1} + \left(1-\vec{z}^{[k]}_t\right)\odot \vec{g_t}^{[k]} \end{align} $$

The input to each layer is, $$ \mat{X}^{[k]} = \begin{bmatrix} {\vec{x}_1}^{[k]} \\ \vdots \\ {\vec{x}_S}^{[k]} \end{bmatrix} = \begin{cases} \mat{X} & \mathrm{if} ~k = 1~ \\ \mathrm{dropout}_p \left( \begin{bmatrix} {\vec{h}_1}^{[k-1]} \\ \vdots \\ {\vec{h}_S}^{[k-1]} \end{bmatrix} \right) & \mathrm{if} ~1 < k \leq L+1~ \end{cases}. $$

The output of the entire model is then, $$ \mat{Y} = \mat{X}^{[L+1]} {\mattr{W}_{\mathrm{hy}}} + \mat{B}_{\mathrm{y}} $$

and the final hidden state is $$ \mat{H} = \begin{bmatrix} {\vec{h}_S}^{[1]} \\ \vdots \\ {\vec{h}_S}^{[L]} \end{bmatrix}. $$

Notes:

  • $t\in[1,S]$ is the timestep, i.e. the current position within the sequence of each sample.
  • $\vec{x}_t^{[k]}$ is the input of layer $k$ at timestep $t$, respectively.
  • The outputs of the last layer $\vec{y}_t^{[L]}$, are the predicted next characters for every input char. These are similar to class scores in classification tasks.
  • The hidden states at the last timestep, $\vec{h}_S^{[k]}$, are the final hidden state returned from the model.
  • $\sigma(\cdot)$ is the sigmoid function, i.e. $\sigma(\vec{z}) = 1/(1+e^{-\vec{z}})$ which returns values in $(0,1)$.
  • $\tanh(\cdot)$ is the hyperbolic tangent, i.e. $\tanh(\vec{z}) = (e^{2\vec{z}}-1)/(e^{2\vec{z}}+1)$ which returns values in $(-1,1)$.
  • $\vec{h_t}^{[k]}$ is the hidden state of layer $k$ at time $t$. This can be thought of as the memory of that layer.
  • $\vec{g_t}^{[k]}$ is the candidate hidden state for time $t+1$.
  • $\vec{z_t}^{[k]}$ is known as the update gate. It combines the previous state with the input to determine how much the current state will be combined with the new candidate state. For example, if $\vec{z_t}^{[k]}=\vec{1}$ then the current input has no effect on the output.
  • $\vec{r_t}^{[k]}$ is known as the reset gate. It combines the previous state with the input to determine how much of the previous state will affect the current state candidate. For example if $\vec{r_t}^{[k]}=\vec{0}$ the previous state has no effect on the current candidate state.

Here's a graphical representation of the GRU's forward pass at each timestep. The $\vec{\tilde{h}}$ in the image is our $\vec{g}$ (candidate next state).

You can see how the reset and update gates allow the model to completely ignore it's previous state, completely ignore it's input, or any mixture of those states (since the gates are actually continuous and between $(0,1)$).

Here's a graphical representation of the entire model. You can ignore the $c_t^{[k]}$ (cell state) variables (which are relevant for LSTM models). Our model has only the hidden state, $h_t^{[k]}$. Also notice that we added dropout between layers (i.e., on the up arrows).

The purple tensors are inputs (a sequence and initial hidden state per layer), and the green tensors are outputs (another sequence and final hidden state per layer). Each blue block implements the above forward equations. Blocks that are on the same vertical level are at the same layer, and therefore share parameters.

TODO: Implement the MultilayerGRU class in the hw3/charnn.py module.

Notes:

  • You'll need to handle input batches now. The math is identical to the above, but all the tensors will have an extra batch dimension as their first dimension.
  • Use the diagram above to help guide your implementation. It will help you visualize what shapes to returns where, etc.
In [14]:
in_dim = vocab_len
h_dim = 256
n_layers = 3
model = charnn.MultilayerGRU(in_dim, h_dim, out_dim=in_dim, n_layers=n_layers)
model = model.to(device)
print(model)

# Test forward pass
y, h = model(x0.to(dtype=torch.float, device=device))
print(f'y.shape={y.shape}')
print(f'h.shape={h.shape}')

test.assertEqual(y.shape, (batch_size, seq_len, vocab_len))
test.assertEqual(h.shape, (batch_size, n_layers, h_dim))
test.assertEqual(len(list(model.parameters())), 9 * n_layers + 2) 
MultilayerGRU(
  (zx_0): Linear(in_features=78, out_features=256, bias=False)
  (zh_0): Linear(in_features=256, out_features=256, bias=True)
  (rx_0): Linear(in_features=78, out_features=256, bias=False)
  (rh_0): Linear(in_features=256, out_features=256, bias=True)
  (gx_0): Linear(in_features=78, out_features=256, bias=False)
  (gh_0): Linear(in_features=256, out_features=256, bias=True)
  (dropout_0): Dropout(p=0, inplace=False)
  (zx_1): Linear(in_features=256, out_features=256, bias=False)
  (zh_1): Linear(in_features=256, out_features=256, bias=True)
  (rx_1): Linear(in_features=256, out_features=256, bias=False)
  (rh_1): Linear(in_features=256, out_features=256, bias=True)
  (gx_1): Linear(in_features=256, out_features=256, bias=False)
  (gh_1): Linear(in_features=256, out_features=256, bias=True)
  (dropout_1): Dropout(p=0, inplace=False)
  (zx_2): Linear(in_features=256, out_features=256, bias=False)
  (zh_2): Linear(in_features=256, out_features=256, bias=True)
  (rx_2): Linear(in_features=256, out_features=256, bias=False)
  (rh_2): Linear(in_features=256, out_features=256, bias=True)
  (gx_2): Linear(in_features=256, out_features=256, bias=False)
  (gh_2): Linear(in_features=256, out_features=256, bias=True)
  (dropout_2): Dropout(p=0, inplace=False)
  (output_layer): Linear(in_features=256, out_features=78, bias=True)
)
y.shape=torch.Size([32, 64, 78])
h.shape=torch.Size([32, 3, 256])

Generating text by sampling¶

Now that we have a model, we can implement text generation based on it. The idea is simple: At each timestep our model receives one char $x_t$ from the input sequence and outputs scores $y_t$ for what the next char should be. We'll convert these scores into a probability over each of the possible chars. In other words, for each input char $x_t$ we create a probability distribution for the next char conditioned on the current one and the state of the model (representing all previous inputs): $$p(x_{t+1}|x_t, \vec{h}_t).$$

Once we have such a distribution, we'll sample a char from it. This will be the first char of our generated sequence. Now we can feed this new char into the model, create another distribution, sample the next char and so on. Note that it's crucial to propagate the hidden state when sampling.

The important point however is how to create the distribution from the scores. One way, as we saw in previous ML tasks, is to use the softmax function. However, a drawback of softmax is that it can generate very diffuse (more uniform) distributions if the score values are very similar. When sampling, we would prefer to control the distributions and make them less uniform to increase the chance of sampling the char(s) with the highest scores compared to the others.

To control the variance of the distribution, a common trick is to add a hyperparameter $T$, known as the temperature to the softmax function. The class scores are simply scaled by $T$ before softmax is applied: $$ \mathrm{softmax}_T(\vec{y}) = \frac{e^{\vec{y}/T}}{\sum_k e^{y_k/T}} $$

A low $T$ will result in less uniform distributions and vice-versa.

TODO: Implement the hot_softmax() function in the hw3/charnn.py module.

In [15]:
scores = y[0,0,:].detach()
_, ax = plt.subplots(figsize=(15,5))

for t in reversed([0.3, 0.5, 1.0, 100]):
    ax.plot(charnn.hot_softmax(scores, temperature=t).cpu().numpy(), label=f'T={t}')
ax.set_xlabel('$x_{t+1}$')
ax.set_ylabel('$p(x_{t+1}|x_t)$')
ax.legend()

uniform_proba = 1/len(char_to_idx)
uniform_diff = torch.abs(charnn.hot_softmax(scores, temperature=100) - uniform_proba)
test.assertTrue(torch.all(uniform_diff < 1e-4))

TODO: Implement the generate_from_model() function in the hw3/charnn.py module.

In [16]:
for _ in range(3):
    text = charnn.generate_from_model(model, "foobar", 50, (char_to_idx, idx_to_char), T=0.5)
    print(text)
    test.assertEqual(len(text), 50)
foobarLAM
?Bs[07vtE EG-'."
LJ-t-1b5LLBU"[q-KCPEw?u
foobar8!dO4gN8u9Xmb5-SvJPhx1dC3z,jCEAX)QI
,RwD,M2M
foobarG&,vsZ;IRhLD]0y1n.)H4[",TEfi6j1Pg'yXIo7y-f(x

Training¶

To train this model, we'll calculate the loss at each time step by comparing the predicted char to the actual char from our label. We can use cross entropy since per char it's similar to a classification problem. We'll then sum the losses over the sequence and back-propagate the gradients though time. Notice that the back-propagation algorithm will "visit" each layer's parameter tensors multiple times, so we'll accumulate gradients in parameters of the blocks. Luckily autograd will handle this part for us.

As usual, the first step of training will be to try and overfit a large model (many parameters) to a tiny dataset. Again, this is to ensure the model and training code are implemented correctly, i.e. that the model can learn.

For a generative model such as this, overfitting is slightly trickier than for classification. What we'll aim to do is to get our model to memorize a specific sequence of chars, so that when given the first char in the sequence it will immediately spit out the rest of the sequence verbatim.

Let's create a tiny dataset to memorize.

In [17]:
# Pick a tiny subset of the dataset
subset_start, subset_end = 1001, 1005
ds_corpus_ss = torch.utils.data.Subset(ds_corpus, range(subset_start, subset_end))
batch_size_ss = 1
sampler_ss = SequenceBatchSampler(ds_corpus_ss, batch_size=batch_size_ss)
dl_corpus_ss = torch.utils.data.DataLoader(ds_corpus_ss, batch_size_ss, sampler=sampler_ss, shuffle=False)

# Convert subset to text
subset_text = ''
for i in range(subset_end - subset_start):
    subset_text += unembed(ds_corpus_ss[i][0])
print(f'Text to "memorize":\n\n{subset_text}')
Text to "memorize":

TRAM. What would you have?
  HELENA. Something; and scarce so much; nothing, indeed.
    I would not tell you what I would, my lord.
    Faith, yes:
    Strangers and foes do sunder and not kiss.
  BERTRAM. I pray you, stay not, but in haste to horse.
  HE

Now let's implement the first part of our training code.

TODO: Implement the train_epoch() and train_batch() methods of the RNNTrainer class in the hw3/training.py module. You must think about how to correctly handle the hidden state of the model between batches and epochs for this specific task (i.e. text generation).

In [18]:
import torch.nn as nn
import torch.optim as optim
from hw3.training import RNNTrainer

torch.manual_seed(42)

lr = 0.01
num_epochs = 500

in_dim = vocab_len
h_dim = 128
n_layers = 2
loss_fn = nn.CrossEntropyLoss()
model = charnn.MultilayerGRU(in_dim, h_dim, out_dim=in_dim, n_layers=n_layers).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
trainer = RNNTrainer(model, loss_fn, optimizer, device)

for epoch in range(num_epochs):
    epoch_result = trainer.train_epoch(dl_corpus_ss, verbose=False)
    
    # Every X epochs, we'll generate a sequence starting from the first char in the first sequence
    # to visualize how/if/what the model is learning.
    if epoch == 0 or (epoch+1) % 25 == 0:
        avg_loss = np.mean(epoch_result.losses)
        accuracy = np.mean(epoch_result.accuracy)
        print(f'\nEpoch #{epoch+1}: Avg. loss = {avg_loss:.3f}, Accuracy = {accuracy:.2f}%')
        
        generated_sequence = charnn.generate_from_model(model, subset_text[0],
                                                        seq_len*(subset_end-subset_start),
                                                        (char_to_idx,idx_to_char), T=0.1)
        
        # Stop if we've successfully memorized the small dataset.
        print(generated_sequence)
        if generated_sequence == subset_text:
            break

# Test successful overfitting
test.assertGreater(epoch_result.accuracy, 99)
test.assertEqual(generated_sequence, subset_text)
Epoch #1: Avg. loss = 3.940, Accuracy = 17.58%
Tn                                                                                                                                                                                                                                       o                      

Epoch #25: Avg. loss = 0.272, Accuracy = 96.09%
TRAM. What would you have?
  HELENA. Something; and scarce so much; nothing, indeed.
    I would not indeed.
    I would not teld and not indeed.
    I would not indeed.
    I would not indeed.
    Faith, yes:
    Faith, yes:
    Faith, yes:
    Faith, yes

Epoch #50: Avg. loss = 0.008, Accuracy = 100.00%
TRAM. What would you have?
  HELENA. Something; and scarce so much; nothing, indeed.
    I would not tell you what I would, my lord.
    Faith, yes:
    Strangers and foes do sunder and not kiss.
  BERTRAM. I pray you, stay not, but in haste to horse.
  HE

OK, so training works - we can memorize a short sequence. We'll now train a much larger model on our large dataset. You'll need a GPU for this part.

First, lets set up our dataset and models for training. We'll split our corpus into 90% train and 10% test-set. Also, we'll use a learning-rate scheduler to control the learning rate during training.

TODO: Set the hyperparameters in the part1_rnn_hyperparams() function of the hw3/answers.py module.

In [19]:
from hw3.answers import part1_rnn_hyperparams

hp = part1_rnn_hyperparams()
print('hyperparams:\n', hp)

### Dataset definition
vocab_len = len(char_to_idx)
batch_size = hp['batch_size']
seq_len = hp['seq_len']
train_test_ratio = 0.9
num_samples = (len(corpus) - 1) // seq_len
num_train = int(train_test_ratio * num_samples)

samples, labels = charnn.chars_to_labelled_samples(corpus, char_to_idx, seq_len, device)

ds_train = torch.utils.data.TensorDataset(samples[:num_train], labels[:num_train])
sampler_train = SequenceBatchSampler(ds_train, batch_size)
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False, sampler=sampler_train, drop_last=True)

ds_test = torch.utils.data.TensorDataset(samples[num_train:], labels[num_train:])
sampler_test = SequenceBatchSampler(ds_test, batch_size)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size, shuffle=False, sampler=sampler_test, drop_last=True)

print(f'Train: {len(dl_train):3d} batches, {len(dl_train)*batch_size*seq_len:7d} chars')
print(f'Test:  {len(dl_test):3d} batches, {len(dl_test)*batch_size*seq_len:7d} chars')

### Training definition
in_dim = out_dim = vocab_len
checkpoint_file = 'checkpoints/rnn'
num_epochs = 50
early_stopping = 5

model = charnn.MultilayerGRU(in_dim, hp['h_dim'], out_dim, hp['n_layers'], hp['dropout'])
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=hp['learn_rate'])
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='max', factor=hp['lr_sched_factor'], patience=hp['lr_sched_patience'], verbose=True
)
trainer = RNNTrainer(model, loss_fn, optimizer, device)
hyperparams:
 {'batch_size': 128, 'seq_len': 128, 'h_dim': 256, 'n_layers': 2, 'dropout': 0.1, 'learn_rate': 0.0005, 'lr_sched_factor': 0.07, 'lr_sched_patience': 0.7}
Train: 348 batches, 5701632 chars
Test:   38 batches,  622592 chars

The code blocks below will train the model and save checkpoints containing the training state and the best model parameters to a file. This allows you to stop training and resume it later from where you left.

Note that you can use the main.py script provided within the assignment folder to run this notebook from the command line as if it were a python script by using the run-nb subcommand. This allows you to train your model using this notebook without starting jupyter. You can combine this with srun or sbatch to run the notebook with a GPU on the course servers.

TODO:

  • Implement the fit() method of the Trainer class. You can reuse the relevant implementation parts from HW2, but make sure to implement early stopping and checkpoints.
  • Implement the test_epoch() and test_batch() methods of the RNNTrainer class in the hw3/training.py module.
  • Run the following block to train.
  • When training is done and you're satisfied with the model's outputs, rename the checkpoint file to checkpoints/rnn_final.pt. This will cause the block to skip training and instead load your saved model when running the homework submission script. Note that your submission zip file will not include the checkpoint file. This is OK.
In [20]:
from cs236781.plot import plot_fit

def post_epoch_fn(epoch, train_res, test_res, verbose):
    # Update learning rate
    scheduler.step(test_res.accuracy)
    # Sample from model to show progress
    if verbose:
        start_seq = "ACT I."
        generated_sequence = charnn.generate_from_model(
            model, start_seq, 100, (char_to_idx,idx_to_char), T=0.5
        )
        print(generated_sequence)

# Train, unless final checkpoint is found
checkpoint_file_final = f'{checkpoint_file}_final.pt'
if os.path.isfile(checkpoint_file_final):
    print(f'*** Loading final checkpoint file {checkpoint_file_final} instead of training')
    saved_state = torch.load(checkpoint_file_final, map_location=device)
    model.load_state_dict(saved_state['model_state'])
else:
    try:
        # Print pre-training sampling
        print(charnn.generate_from_model(model, "ACT I.", 100, (char_to_idx,idx_to_char), T=0.5))

        fit_res = trainer.fit(dl_train, dl_test, num_epochs, max_batches=None,
                              post_epoch_fn=post_epoch_fn, early_stopping=early_stopping,
                              checkpoints=checkpoint_file, print_every=1)
        
        fig, axes = plot_fit(fit_res)
    except KeyboardInterrupt as e:
        print('\n *** Training interrupted by user')
ACT I.&7UQ?P76Ltq,JbWO'K1:wnR)5XWK2"K7wT6eToffgZq8n0zA2jweTiWf??'d95.7f(ELP606Q
[ubaETa.5eTT
isQe?N(
*** Loading checkpoint file checkpoints/rnn.pt
--- EPOCH 1/50 ---
train_batch (Avg. Loss 1.144, Accuracy 64.8): 100%|██████████| 348/348 [06:54<00:00,  1.19s/it]
test_batch (Avg. Loss 1.354, Accuracy 57.9): 100%|██████████| 38/38 [00:12<00:00,  3.14it/s]
ACT I.
Where is the matter now goes to my soul,
And sent thee strawged of death.

KING RICHARD II:
A
--- EPOCH 2/50 ---
train_batch (Avg. Loss 1.143, Accuracy 64.9): 100%|██████████| 348/348 [04:59<00:00,  1.16it/s]
test_batch (Avg. Loss 1.354, Accuracy 57.9): 100%|██████████| 38/38 [00:09<00:00,  3.89it/s]
Epoch     2: reducing learning rate of group 0 to 3.5000e-05.
ACT I.
The conscience therein the way to stay
To stay at the state of war in this feet
With an apple
--- EPOCH 3/50 ---
train_batch (Avg. Loss 1.152, Accuracy 64.6): 100%|██████████| 348/348 [05:46<00:00,  1.01it/s]
test_batch (Avg. Loss 1.364, Accuracy 57.8): 100%|██████████| 38/38 [00:15<00:00,  2.41it/s]
Epoch     3: reducing learning rate of group 0 to 2.4500e-06.
ACT I. SCENE I.
Verona, prince of the sea, and the more the process of the time
  That they shall fi
--- EPOCH 4/50 ---
train_batch (Avg. Loss 1.147, Accuracy 64.8): 100%|██████████| 348/348 [05:17<00:00,  1.10it/s]
test_batch (Avg. Loss 1.368, Accuracy 57.7): 100%|██████████| 38/38 [00:10<00:00,  3.63it/s]
Epoch     4: reducing learning rate of group 0 to 1.7150e-07.
ACT I. SCENE I.
A harm speak of the streets, the blood is dead.
                                    
--- EPOCH 5/50 ---
train_batch (Avg. Loss 1.144, Accuracy 64.8): 100%|██████████| 348/348 [07:28<00:00,  1.29s/it]
test_batch (Avg. Loss 1.368, Accuracy 57.7): 100%|██████████| 38/38 [00:12<00:00,  3.05it/s]
Epoch     5: reducing learning rate of group 0 to 1.2005e-08.
ACT I.
Well gone, and therein know of the Prince and Saint Albans,
Of the man will be the very hones
--- EPOCH 6/50 ---
train_batch (Avg. Loss 1.143, Accuracy 64.8): 100%|██████████| 348/348 [04:59<00:00,  1.16it/s]
test_batch (Avg. Loss 1.368, Accuracy 57.7): 100%|██████████| 38/38 [00:10<00:00,  3.73it/s]

Generating a work of art¶

Armed with our fully trained model, let's generate the next Hamlet! You should experiment with modifying the sampling temperature and see what happens.

The text you generate should “look” like a Shakespeare play: old-style English words and sentence structure, directions for the actors (like “Exit/Enter”), sections (Act I/Scene III) etc. There will be no coherent plot of course, but it should at least seem like a Shakespearean play when not looking too closely. If this is not what you see, go back, debug and/or and re-train.

TODO: Specify the generation parameters in the part1_generation_params() function within the hw3/answers.py module.

In [21]:
from hw3.answers import part1_generation_params

start_seq, temperature = part1_generation_params()
print(start_seq)
generated_sequence = charnn.generate_from_model(
    model, start_seq, 10000, (char_to_idx,idx_to_char), T=temperature
)

print(generated_sequence)
ACT I.
ACT I. Scarce, help men!
  TLOUBAN. No, my lords, come and a slave with her;
    That what to drink my mout Pompey! Fin?
  VINSINIA. Be cites art thou not bereft a lin once as harm'd.
  LUCENTIO. Nothing, no, no better; 'our pitch is he,
    To serve your labour, but he will convey his own feeping liege,
     To bear with fancy to ran't.
  AARON. Thou art unconstruction of her wife. Let me see,
    Your most rich weeds of the Roman of thy master-horse!
  BUCKINGHAM. The hoom as your dispositions of the head. Will you so?
    In dangers, side,  
    And Cassio despairs from winds, and to't, and that we are left?
  HORTENSIO. And there's praised of a book in my heart!
  SPEED. Thou hast mock'd with a charge, my dog discourse
    That she was poor love.

                        Enter TROILUS


COMINIUS:
You speak moeseld sorrow. Adieu
Ophe will revenge my word. O banish'd Caesar will die,
The palen of Menetalinges pardon me. I,
  Thou hadst a time to steal away, to hard heels
Amongst one; puse, we are valiant.

KING RICHARD III:

COMINIUS:
Alack, when evermore
Be gone, to a sin'd guard
That Richard hope and negligence,
That now of parts and penalty hath put your princely man.] The gods
Regist with guilty it to the guard
That with this day where I vill give him he!
My Lord Colours.

First Caesar.

Enter a MESSENGER

  MACDEIO. O that I am, you did do with me. I will a poor private fools
    As almost muddings victory, and like a whore
    To stir men's names on Richmond thus? But if you do an elder!
    The banish'd soul to want commands,
    Be it at any codio. Will you have a thing?
  VIOLA. It choose here with all the offences for every
    monster, and do me base mend.
  PANDARUS. Pause hence that he wears the dog, which they spake down the stocking dream
    at his wings.                               Exit CADESUS

    How have you stand better plants,
    They say, my gracious lord? What must be goat, and let thy leave of her,
    To help the sparrow that help me so. Come, come with him.
    But my ancient paper deaths have your heart
    And after him in every arms as in away to put it of a custom, and my house?
  PATROCLUS. Sir, I have sab thy labour.
  POUT. Where are visions?
    Had he not horselies in your children here?
    The feast of such a gracious other way, it was my wischpless Scot;
    For we cannot come, go in.
  FIRST WITCH. CLEFPEIN, HACTIO, thus thou misshamed them both:
There lifts shall we will tear this woman's tongue of it, and the town,
Which were we like your ladyship's queen!

KING RICHARD II:
Why brothers, I do not ask gentleman.

COMINIUS:
Here, shall running you all
I have a man of my company,
There is towards it every and forged;
There at Posthumus, plotmen why I swore in this directitus of men, I fearful: therefore,
And wouldst have leave to penise to death,
And delay the eye in this
They bid me no person of my worth;
For now I seept a shows offt me.

BRUTUS:
Stay, take him to my house:
The gods show'd descent. Which of these
    heads and weaker with his will, were you tear thy head,
    Still have consul, which never have the funeral rivers heard
    That no drive him that feed I have to them my word.
                                                [Mark him any good and the walls of Buckingham.

KING RICHARD III:
Where let the obey'd wisding for sailor
Unto them open the crown.

KING RICHARD II:
Madam, nor did you cozen us
Shall do amy hope, and in this eye- but, where
And many several kind of blanch: our housesies
Unto the shadow of your help of howls.
If thou beest half so mad to you.

DUCHESS OF MIRHAMAN
    He has done me. Kate Caesar! I am denote none,
    And say he your husping attempts To the chanting commander of the body's
    he seem true to you.  
  OTHELLO. Where is my hairshop, and be
    fair, in such cures.
                                                          Exit I
     The King is better for the proportions to a day of gold griefs;
    He would not snatch a comfort eye.
    For thou shalt often do, Nemiam that he would
    stood and done to the colour, as you know our election
    Comes the trade, sister in articles,
    It is, false but once with your deceit and night.
    This rare bowels place your advanc'd for a brooking part,
    I am his felles of revenges; be here in my power,
    As I, ele convenient knowledge to a tender story,
    And put us to appear to this dream. [Gaithing]
    Dishonour'd all the higheating of the seated.
  Leon. O that he will! A most death'd the venture of the night,
    And Highness grieves shall could give the murderer.
  PATROCLUS. What's in my life and fire, that all denies' master,
    Talk with smoking comfort, and yet not hurtly, and his safety.
     Regent to thy
    uncle, sir, we will appear this creature that is but these
beeps that bepose to come with leasing upon our feeds
    Procuable dost the pend thee, and there to hear it,
    To appear with rights on the care into their luckient,
    I hence their arms was water I might adventure more
    out of death, and men, I'll not put it on your father,
    As in the rascal than the pit.
    Look'd ports that had show'd the field?
  CONSTANCE. If thou hadst wooer, love; bring him, and is it not.
  PROTEUS. On thy pride?
    She has 'em self-boy,
    That would be deadly surfeited should be
    Shall pay the street till you see his pleasure in his will;
    He's rich. Hath put me a conjunct o' the passion wretched?
  SOOTHMAN. No, not-
  POET. What bears are infest's fortune; would they had wish to you.
    What taxts at thy state, I think, ruke. That's the matter;
    He spoke my life, in other men are gone.
  CRESSIDA. Good matter is much good art thou so loudy obedal of it?
  THURIO. Well, leave to bunk in Bardolph, go here, dissembled men,
    And would not look upon your head
    That fall on him drawms a mean as thrice mune
    nigh.
  PASSANIA. Ay, my brothers, as I told you, such an aspect,
    Our spite most exact no less;
    To pay them thee, but to persevered me
    And move your Grace will not hero. Let's rather till then.
  LAUNCE. This day lanker, they should not speak on it.
  THIRD LORD. What is the note? And for her tongue hath sent I heard
    For two his pays of years in Leonato's blood.
  GRUMIO. I thank my young gentlemen with in a little budge,
    And smooth a ll still serve threat'ning
    Whein that green labour to my eyes that wouldst die indeed,
    In silence is a heart to wear any proportion of thy boldly;
    And therefore clear detraction of his incensed matter straight,
    Henry husbands.
  CLOWN. If I do luguard standing aside;
    Who can enrog, I will undo on the enemy's vale,
    And want the humble- a thousand man's will proper them?
  PETRUCHIO. Good night, servile that.
  SPEED. I would he so die home but it is as pale in fear.
    What! I come, and not like thy physician
    Do my true subtle-good buried, speaks, the trade,
    If I shall grant these minds fetch it for thee but a right good
    Than fairy wounds I am to mirgh the gentler than right.
    But make the boy is flat, so sweet a word.
    What, will you swin.

           Here to gild no careless thus:
That throw it too both and solemnly read us, herein yours but made it, sirrah;
And seeks engag'd adustrised off whereof.

DUKE OF YORK:
And women, do you tell me, let's follow him.
DROMIO OF SYRACUSE. Wert thou the coward as the garland put but lame wise, with thy breath may deck the head,
    That I should suppose that weightiar blackest mead,
    Or have the shame that makes me suppose you, to dunity Caesar between, pluck command,
    But she can sin, and thus unshallowed my true cloudy;
    And had we mingled knowish too say God sends from your right mistress,
    Arve men but notes.
  IAGO.  [Within]  Have you the best and extirr'd skips fine attempts not in with earth
    As might have heard the drowned printtles,
    And then but pity of this tail,
    And keen thee in a horse. Tell me what I can know my shepherd,
    Not presently heart she died.
  DEMETRIUS. Farewell! There you speak before.
    Yet note, the modest arms to give.
    Many a murderer to
    the child!
  DUKE. Hold thyself perpetual noble fellows that hath brought me now,
    And that she hath gone of work.
  KING RICHARD. What with his prayest to come before you
    great Pander, I will or no;
    The roses of my great desirez-hangeflement.
  SEBASTIAN. I'll have a strange handy man of the beheman thoughts
    More borne shall cut is our sounds-
    Whether there's no young lady of my courage is yours.
  Jul. Thou art in pride from down and clear'd 'em?
  ALL. My lord, we swear in such as affections.
    Come, in good sir, I pray
    Would be pardon to all; it is true, that I swear
    And I myself to fear the first gate.
  LIEUTENANT. Pray you now be play'd to supper from my father are age.
  LEONTES. [Aside to SIr have not follow'd in her youth,
Othello could not ask and take no more.
                                                    Exit
    O thou feasts alle in fair queen?
  BOLINGBROKE. And 'tis the sember of my wability,
    Why are the golden geise of kings.
  BIONDELLO. Now let me do, pity me;
The care, I am well sold, let's be forswearing,
               Revoke and wanton my rapians: not have you undone the corpret;
   Like his age three man wounded her kindred.
ANGELO. What Romeo had made my power was loves? Can you present me well!-
    I'll prevent thy bel.
  WOLSOY. Forgive him you that professes amsid
    hang.
  SIR TOBY. I will doward so prevails of blood. Lord of our piles to the dissembling loving cross'd
    We wish he would perforce have lost;
    But wheir blood, bosom, beauty days these unches;
    Then if we break the blossomerseen of love.
    He keeps with honour; say not work.
  WARWICK. Thou hast not loss and delight and my husband
    Have stays to conquer'd you it is boundly say;
    For, like in their ambrac'd well sure of the middle of men
    And means. F

Questions¶

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw3/answers.py.

In [22]:
from cs236781.answers import display_answer
import hw3.answers

Question 1¶

Why do we split the corpus into sequences instead of training on the whole text?

In [23]:
display_answer(hw3.answers.part1_q1)

Your answer:

We split the corpus into sequences instead of training on the whole text for a few reasons. It helps maintain contextual relationships between close sentences and avoids learning irrelevant patterns. Additionally, it mitigates issues with vanishing or exploding gradients. Training on smaller sequences allows for faster training and better capturing of localized context. Overall, splitting the corpus into sequences improves learning, addresses gradient-related problems, and enhances training efficiency.

Question 2¶

How is it possible that the generated text clearly shows memory longer than the sequence length?

In [24]:
display_answer(hw3.answers.part1_q2)

Your answer:

The generated text can show memory longer than the sequence length due to the utilization of hidden states in the model. Similar to human memory mechanisms, the model maintains a hidden state that carries information from previous batches. This allows the model to access context beyond the current sequence length and incorporate it into the generated text. By propagating the hidden states throughout successive batches in the training loop, the model benefits from the accumulated context and demonstrates the ability to exhibit longer-term memory in the generated output. This combination of hidden states and sequential training enables the model to generate coherent and contextually rich text beyond the limitations of the individual sequence length.

Question 3¶

Why are we not shuffling the order of batches when training?

In [25]:
display_answer(hw3.answers.part1_q3)

Your answer:

We do not shuffle the order of batches when training in order to preserve the sequential nature and continuity of the text. Language and sentences carry crucial context, and their order plays a significant role in conveying meaning. Shuffling the batches would disrupt this inherent structure and potentially lead to the loss of coherent and meaningful sequences. Additionally, by maintaining the order of batches, we ensure that the hidden states in the model carry forward relevant information from one batch to the next. This continuity allows the model to effectively leverage the context provided by the sequential nature of the text, enabling it to generate coherent and contextually appropriate responses. By keeping the order intact, we facilitate the model's ability to understand and generate text that adheres to the linguistic structure and maintains the desired flow of information.

Question 4¶

  1. Why do we lower the temperature for sampling (compared to the default of $1.0$)?
  2. What happens when the temperature is very high and why?
  3. What happens when the temperature is very low and why?
In [26]:
display_answer(hw3.answers.part1_q4)

Your answer:

4.1. We lower the temperature for sampling (compared to the default of 1.0) to control the diversity and randomness of the generated text. By decreasing the temperature, we make the probability distribution less uniform and give more weight to characters with higher scores. This allows us to generate text that aligns more closely with the model's confident predictions and reduces the likelihood of sampling less likely characters.

4.2. When the temperature is very high, the probability distribution becomes more uniform. This means that all characters have a similar probability of being selected, regardless of their scores. As a result, the generated text becomes more random and less meaningful. The high temperature encourages exploration of various possibilities but can lead to less coherent and structured output.

4.3. When the temperature is very low, the probability distribution becomes highly peaked or one-hot encoded. This means that the next character is predominantly determined by the character with the highest score. The low temperature makes the model more deterministic and focused on the most likely predictions. This can result in repetitive patterns and a lack of variability in the generated text.

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bm}[1]{{\bf #1}} \newcommand{\bb}[1]{\bm{\mathrm{#1}}} $$

Part 2: Variational Autoencoder¶

In this part we will learn to generate new data using a special type of autoencoder model which allows us to sample from its latent space. We'll implement and train a VAE and use it to generate new images.

In [1]:
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import zipfile

import numpy as np
import torch
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2
In [2]:
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
Using device: cpu

Obtaining the dataset¶

Let's begin by downloading a dataset of images that we want to learn to generate. We'll use the Labeled Faces in the Wild (LFW) dataset which contains many labeled faces of famous individuals.

We're going to train our generative model to generate a specific face, not just any face. Since the person with the most images in this dataset is former president George W. Bush, we'll set out to train a Bush Generator :)

However, if you feel adventurous and/or prefer to generate something else, feel free to edit the PART2_CUSTOM_DATA_URL variable in hw3/answers.py.

In [3]:
import cs236781.plot as plot
import cs236781.download
from hw3.answers import PART2_CUSTOM_DATA_URL as CUSTOM_DATA_URL

DATA_DIR = pathlib.Path.home().joinpath('.pytorch-datasets')
if CUSTOM_DATA_URL is None:
    DATA_URL = 'http://vis-www.cs.umass.edu/lfw/lfw-bush.zip'
else:
    DATA_URL = CUSTOM_DATA_URL

_, dataset_dir = cs236781.download.download_data(out_path=DATA_DIR, url=DATA_URL, extract=True, force=False)
File /home/hay.e/.pytorch-datasets/lfw-bush.zip exists, skipping download.
Extracting /home/hay.e/.pytorch-datasets/lfw-bush.zip...
Extracted 531 to /home/hay.e/.pytorch-datasets/lfw/George_W_Bush

Create a Dataset object that will load the extraced images:

In [4]:
import torchvision.transforms as T
from torchvision.datasets import ImageFolder

im_size = 64
tf = T.Compose([
    # Resize to constant spatial dimensions
    T.Resize((im_size, im_size)),
    # PIL.Image -> torch.Tensor
    T.ToTensor(),
    # Dynamic range [0,1] -> [-1, 1]
    T.Normalize(mean=(.5,.5,.5), std=(.5,.5,.5)),
])

ds_gwb = ImageFolder(os.path.dirname(dataset_dir), tf)

OK, let's see what we got. You can run the following block multiple times to display a random subset of images from the dataset.

In [5]:
_ = plot.dataset_first_n(ds_gwb, 50, figsize=(15,10), nrows=5)
print(f'Found {len(ds_gwb)} images in dataset folder.')
Found 530 images in dataset folder.
In [6]:
x0, y0 = ds_gwb[0]
x0 = x0.unsqueeze(0).to(device)
print(x0.shape)

test.assertSequenceEqual(x0.shape, (1, 3, im_size, im_size))
torch.Size([1, 3, 64, 64])

The Variational Autoencoder¶

An autoencoder is a model which learns a representation of data in an unsupervised fashion (i.e without any labels). Recall it's general form from the lecture:

An autoencoder maps an instance $\bb{x}$ to a latent-space representation $\bb{z}$. It has an encoder part, $\Phi_{\bb{\alpha}}(\bb{x})$ (a model with parameters $\bb{\alpha}$) and a decoder part, $\Psi_{\bb{\beta}}(\bb{z})$ (a model with parameters $\bb{\beta}$).

While autoencoders can learn useful representations, generally it's hard to use them as generative models because there's no distribution we can sample from in the latent space. In other words, we have no way to choose a point $\bb{z}$ in the latent space such that $\Psi(\bb{z})$ will end up on the data manifold in the instance space.

The variational autoencoder (VAE), first proposed by Kingma and Welling, addresses this issue by taking a probabilistic perspective. Briefly, a VAE model can be described as follows.

We define, in Baysean terminology,

  • The prior distribution $p(\bb{Z})$ on points in the latent space.
  • The posterior distribution of points in the latent spaces given a specific instance: $p(\bb{Z}|\bb{X})$.
  • The likelihood distribution of a sample $\bb{X}$ given a latent-space representation: $p(\bb{X}|\bb{Z})$.
  • The evidence distribution $p(\bb{X})$ which is the distribution of the instance space due to the generative process.

To create our variational decoder we'll further specify:

  • A parametric likelihood distribution, $p _{\bb{\beta}}(\bb{X} | \bb{Z}=\bb{z}) = \mathcal{N}( \Psi _{\bb{\beta}}(\bb{z}) , \sigma^2 \bb{I} )$. The interpretation is that given a latent $\bb{z}$, we map it to a point normally distributed around the point calculated by our decoder neural network. Note that here $\sigma^2$ is a hyperparameter while $\vec{\beta}$ represents the network parameters.
  • A fixed latent-space prior distribution of $p(\bb{Z}) = \mathcal{N}(\bb{0},\bb{I})$.

This setting allows us to generate a new instance $\bb{x}$ by sampling $\bb{z}$ from the multivariate normal distribution, obtaining the instance-space mean $\Psi _{\bb{\beta}}(\bb{z})$ using our decoder network, and then sampling $\bb{x}$ from $\mathcal{N}( \Psi _{\bb{\beta}}(\bb{z}) , \sigma^2 \bb{I} )$.

Our variational encoder will approximate the posterior with a parametric distribution $q _{\bb{\alpha}}(\bb{Z} | \bb{x}) = \mathcal{N}( \bb{\mu} _{\bb{\alpha}}(\bb{x}), \mathrm{diag}\{ \bb{\sigma}^2_{\bb{\alpha}}(\bb{x}) \} )$. The interpretation is that our encoder model, $\Phi_{\vec{\alpha}}(\bb{x})$, calculates the mean and variance of the posterior distribution, and samples $\bb{z}$ based on them. An important nuance here is that our network can't contain any stochastic elements that depend on the model parameters, otherwise we won't be able to back-propagate to those parameters. So sampling $\bb{z}$ from $\mathcal{N}( \bb{\mu} _{\bb{\alpha}}(\bb{x}), \mathrm{diag}\{ \bb{\sigma}^2_{\bb{\alpha}}(\bb{x}) \} )$ is not an option. The solution is to use what's known as the reparametrization trick: sample from an isotropic Gaussian, i.e. $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$ (which doesn't depend on trainable parameters), and calculate the latent representation as $\bb{z} = \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{u}\odot\bb{\sigma}_{\bb{\alpha}}(\bb{x})$.

To train a VAE model, we maximize the evidence distribution, $p(\bb{X})$ (see question below). The VAE loss can therefore be stated as minimizing $\mathcal{L} = -\mathbb{E}_{\bb{x}} \log p(\bb{X})$. Although this expectation is intractable, we can obtain a lower-bound for $p(\bb{X})$ (the evidence lower bound, "ELBO", shown in the lecture):

$$ \log p(\bb{X}) \ge \mathbb{E} _{\bb{z} \sim q _{\bb{\alpha}} }\left[ \log p _{\bb{\beta}}(\bb{X} | \bb{z}) \right] - \mathcal{D} _{\mathrm{KL}}\left(q _{\bb{\alpha}}(\bb{Z} | \bb{X})\,\left\|\, p(\bb{Z} )\right.\right) $$

where $ \mathcal{D} _{\mathrm{KL}}(q\left\|\right.p) = \mathbb{E}_{\bb{z}\sim q}\left[ \log \frac{q(\bb{Z})}{p(\bb{Z})} \right] $ is the Kullback-Liebler divergence, which can be interpreted as the information gained by using the posterior $q(\bb{Z|X})$ instead of the prior distribution $p(\bb{Z})$.

Using the ELBO, the VAE loss becomes, $$ \mathcal{L}(\vec{\alpha},\vec{\beta}) = \mathbb{E} _{\bb{x}} \left[ \mathbb{E} _{\bb{z} \sim q _{\bb{\alpha}} }\left[ -\log p _{\bb{\beta}}(\bb{x} | \bb{z}) \right] + \mathcal{D} _{\mathrm{KL}}\left(q _{\bb{\alpha}}(\bb{Z} | \bb{x})\,\left\|\, p(\bb{Z} )\right.\right) \right]. $$

By remembering that the likelihood is a Gaussian distribution with a diagonal covariance and by applying the reparametrization trick, we can write the above as

$$ \mathcal{L}(\vec{\alpha},\vec{\beta}) = \mathbb{E} _{\bb{x}} \left[ \mathbb{E} _{\bb{z} \sim q _{\bb{\alpha}} } \left[ \frac{1}{2\sigma^2}\left\| \bb{x}- \Psi _{\bb{\beta}}\left( \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{\Sigma}^{\frac{1}{2}} _{\bb{\alpha}}(\bb{x}) \bb{u} \right) \right\| _2^2 \right] + \mathcal{D} _{\mathrm{KL}}\left(q _{\bb{\alpha}}(\bb{Z} | \bb{x})\,\left\|\, p(\bb{Z} )\right.\right) \right]. $$

Model Implementation¶

Obviously our model will have two parts, an encoder and a decoder. Since we're working with images, we'll implement both as deep convolutional networks, where the decoder is a "mirror image" of the encoder implemented with adjoint (AKA transposed) convolutions. Between the encoder CNN and the decoder CNN we'll implement the sampling from the parametric posterior approximator $q_{\bb{\alpha}}(\bb{Z}|\bb{x})$ to make it a VAE model and not just a regular autoencoder (of course, this is not yet enough to create a VAE, since we also need a special loss function which we'll get to later).

First let's implement just the CNN part of the Encoder network (this is not the full $\Phi_{\vec{\alpha}}(\bb{x})$ yet). As usual, it should take an input image and map to a activation volume of a specified depth. We'll consider this volume as the features we extract from the input image. Later we'll use these to create the latent space representation of the input.

TODO: Implement the EncoderCNN class in the hw3/autoencoder.py module. Implement any CNN architecture you like. If you need "architecture inspiration" you can see e.g. this or this paper.

In [7]:
import hw3.autoencoder as autoencoder

in_channels = 3
out_channels = 1024
encoder_cnn = autoencoder.EncoderCNN(in_channels, out_channels).to(device)
print(encoder_cnn)

h = encoder_cnn(x0)
print(h.shape)

test.assertEqual(h.dim(), 4)
test.assertSequenceEqual(h.shape[0:2], (1, out_channels))
EncoderCNN(
  (cnn): Sequential(
    (0): Conv2d(3, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): Conv2d(64, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
    (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU()
    (6): Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
    (7): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (8): ReLU()
    (9): Conv2d(256, 1024, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
    (10): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): ReLU()
  )
)
torch.Size([1, 1024, 4, 4])

Now let's implement the CNN part of the Decoder. Again this is not yet the full $\Psi _{\bb{\beta}}(\bb{z})$. It should take an activation volume produced by your EncoderCNN and output an image of the same dimensions as the Encoder's input was. This can be a CNN which is like a "mirror image" of the the Encoder. For example, replace convolutions with transposed convolutions, downsampling with up-sampling etc. Consult the documentation of ConvTranspose2D to figure out how to reverse your convolutional layers in terms of input and output dimensions. Note that the decoder doesn't have to be exactly the opposite of the encoder and you can experiment with using a different architecture.

TODO: Implement the DecoderCNN class in the hw3/autoencoder.py module.

In [8]:
decoder_cnn = autoencoder.DecoderCNN(in_channels=out_channels, out_channels=in_channels).to(device)
print(decoder_cnn)
x0r = decoder_cnn(h)
print(x0r.shape)

test.assertEqual(x0.shape, x0r.shape)

# Should look like colored noise
T.functional.to_pil_image(x0r[0].cpu().detach())
DecoderCNN(
  (cnn): Sequential(
    (0): ConvTranspose2d(1024, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU()
    (6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (8): ReLU()
    (9): ConvTranspose2d(128, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (10): Tanh()
  )
)
torch.Size([1, 3, 64, 64])
Out[8]:

Let's now implement the full VAE Encoder, $\Phi_{\vec{\alpha}}(\vec{x})$. It will work as follows:

  1. Produce a feature vector $\vec{h}$ from the input image $\vec{x}$.
  2. Use two affine transforms to convert the features into the mean and log-variance of the posterior, i.e. $$ \begin{align} \bb{\mu} _{\bb{\alpha}}(\bb{x}) &= \vec{h}\mattr{W}_{\mathrm{h\mu}} + \vec{b}_{\mathrm{h\mu}} \\ \log\left(\bb{\sigma}^2_{\bb{\alpha}}(\bb{x})\right) &= \vec{h}\mattr{W}_{\mathrm{h\sigma^2}} + \vec{b}_{\mathrm{h\sigma^2}} \end{align} $$
  3. Use the reparametrization trick to create the latent representation $\vec{z}$.

Notice that we model the log of the variance, not the actual variance. The above formulation is proposed in appendix C of the VAE paper.

TODO: Implement the encode() method in the VAE class within the hw3/autoencoder.py module. You'll also need to define your parameters in __init__().

In [9]:
z_dim = 2
vae = autoencoder.VAE(encoder_cnn, decoder_cnn, x0[0].size(), z_dim).to(device)
print(vae)

z, mu, log_sigma2 = vae.encode(x0)

test.assertSequenceEqual(z.shape, (1, z_dim))
test.assertTrue(z.shape == mu.shape == log_sigma2.shape)

print(f'mu(x0)={list(*mu.detach().cpu().numpy())}, sigma2(x0)={list(*torch.exp(log_sigma2).detach().cpu().numpy())}')
VAE(
  (features_encoder): EncoderCNN(
    (cnn): Sequential(
      (0): Conv2d(3, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
      (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
      (3): Conv2d(64, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
      (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): ReLU()
      (6): Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
      (7): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (8): ReLU()
      (9): Conv2d(256, 1024, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
      (10): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (11): ReLU()
    )
  )
  (features_decoder): DecoderCNN(
    (cnn): Sequential(
      (0): ConvTranspose2d(1024, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
      (3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): ReLU()
      (6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (8): ReLU()
      (9): ConvTranspose2d(128, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (10): Tanh()
    )
  )
  (mu_alpha): Linear(in_features=16384, out_features=2, bias=True)
  (log_variance): Linear(in_features=16384, out_features=2, bias=True)
  (latent_transformer): Linear(in_features=2, out_features=16384, bias=True)
)
mu(x0)=[0.16675968, -0.5646963], sigma2(x0)=[0.995804, 0.50216246]

Let's sample some 2d latent representations for an input image x0 and visualize them.

In [10]:
# Sample from q(Z|x)
N = 500
Z = torch.zeros(N, z_dim)
_, ax = plt.subplots()
with torch.no_grad():
    for i in range(N):
        Z[i], _, _ = vae.encode(x0)
        ax.scatter(*Z[i].cpu().numpy())

# Should be close to the mu/sigma in the previous block above
print('sampled mu', torch.mean(Z, dim=0))
print('sampled sigma2', torch.var(Z, dim=0))
sampled mu tensor([ 0.1835, -0.5640])
sampled sigma2 tensor([1.0628, 0.4823])

Let's now implement the full VAE Decoder, $\Psi _{\bb{\beta}}(\bb{z})$. It will work as follows:

  1. Produce a feature vector $\tilde{\vec{h}}$ from the latent vector $\vec{z}$ using an affine transform.
  2. Reconstruct an image $\tilde{\vec{x}}$ from $\tilde{\vec{h}}$ using the decoder CNN.

TODO: Implement the decode() method in the VAE class within the hw3/autoencoder.py module. You'll also need to define your parameters in __init__(). You may need to also re-run the block above after you implement this.

In [11]:
x0r = vae.decode(z)

test.assertSequenceEqual(x0r.shape, x0.shape)

Our model's forward() function will simply return decode(encode(x)) as well as the calculated mean and log-variance of the posterior.

In [12]:
x0r, mu, log_sigma2 = vae(x0)

test.assertSequenceEqual(x0r.shape, x0.shape)
test.assertSequenceEqual(mu.shape, (1, z_dim))
test.assertSequenceEqual(log_sigma2.shape, (1, z_dim))
T.functional.to_pil_image(x0r[0].detach().cpu())
Out[12]:

Loss Implementation¶

In practice, since we're using SGD, we'll drop the expectation over $\bb{X}$ and instead sample an instance from the training set and compute a point-wise loss. Similarly, we'll drop the expectation over $\bb{Z}$ by sampling from $q_{\vec{\alpha}}(\bb{Z}|\bb{x})$. Additionally, because the KL divergence is between two Gaussian distributions, there is a closed-form expression for it. These points bring us to the following point-wise loss:

$$ \ell(\vec{\alpha},\vec{\beta};\bb{x}) = \frac{1}{\sigma^2 d_x} \left\| \bb{x}- \Psi _{\bb{\beta}}\left( \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{\Sigma}^{\frac{1}{2}} _{\bb{\alpha}}(\bb{x}) \bb{u} \right) \right\| _2^2 + \mathrm{tr}\,\bb{\Sigma} _{\bb{\alpha}}(\bb{x}) + \|\bb{\mu} _{\bb{\alpha}}(\bb{x})\|^2 _2 - d_z - \log\det \bb{\Sigma} _{\bb{\alpha}}(\bb{x}), $$

where $d_z$ is the dimension of the latent space, $d_x$ is the dimension of the input and $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$. This pointwise loss is the quantity that we'll compute and minimize with gradient descent. The first term corresponds to the data-reconstruction loss, while the second term corresponds to the KL-divergence loss. Note that the scaling by $d_x$ is not derived from the original loss formula and was added directly to the pointwise loss just to normalize the data term.

TODO: Implement the vae_loss() function in the hw3/autoencoder.py module.

In [13]:
from hw3.autoencoder import vae_loss
torch.manual_seed(42)

def test_vae_loss():
    # Test data
    N, C, H, W = 10, 3, 64, 64 
    z_dim = 32
    x  = torch.randn(N, C, H, W)*2 - 1
    xr = torch.randn(N, C, H, W)*2 - 1
    z_mu = torch.randn(N, z_dim)
    z_log_sigma2 = torch.randn(N, z_dim)
    x_sigma2 = 0.9
    
    loss, _, _ = vae_loss(x, xr, z_mu, z_log_sigma2, x_sigma2)
    
    test.assertAlmostEqual(loss.item(), 58.3234367, delta=1e-3)
    return loss

test_vae_loss()
Out[13]:
tensor(58.3234)

Sampling¶

The main advantage of a VAE is that it can by used as a generative model by sampling the latent space, since we optimize for a isotropic Gaussian prior $p(\bb{Z})$ in the loss function. Let's now implement this so that we can visualize how our model is doing when we train.

TODO: Implement the sample() method in the VAE class within the hw3/autoencoder.py module.

In [14]:
samples = vae.sample(5)
_ = plot.tensors_as_images(samples)

Training¶

Time to train!

TODO:

  1. Implement the VAETrainer class in the hw3/training.py module. Make sure to implement the checkpoints feature of the Trainer class if you haven't done so already in Part 1.
  2. Tweak the hyperparameters in the part2_vae_hyperparams() function within the hw3/answers.py module.
In [15]:
import torch.optim as optim
from torch.utils.data import random_split
from torch.utils.data import DataLoader
from torch.nn import DataParallel
from hw3.training import VAETrainer
from hw3.answers import part2_vae_hyperparams

torch.manual_seed(42)

# Hyperparams
hp = part2_vae_hyperparams()
batch_size = hp['batch_size']
h_dim = hp['h_dim']
z_dim = hp['z_dim']
x_sigma2 = hp['x_sigma2']
learn_rate = hp['learn_rate']
betas = hp['betas']

# Data
split_lengths = [int(len(ds_gwb)*0.9), int(len(ds_gwb)*0.1)]
ds_train, ds_test = random_split(ds_gwb, split_lengths)
dl_train = DataLoader(ds_train, batch_size, shuffle=True)
dl_test  = DataLoader(ds_test,  batch_size, shuffle=True)
im_size = ds_train[0][0].shape

# Model
encoder = autoencoder.EncoderCNN(in_channels=im_size[0], out_channels=h_dim)
decoder = autoencoder.DecoderCNN(in_channels=h_dim, out_channels=im_size[0])
vae = autoencoder.VAE(encoder, decoder, im_size, z_dim)
vae_dp = DataParallel(vae).to(device)

# Optimizer
optimizer = optim.Adam(vae.parameters(), lr=learn_rate, betas=betas)

# Loss
def loss_fn(x, xr, z_mu, z_log_sigma2):
    return autoencoder.vae_loss(x, xr, z_mu, z_log_sigma2, x_sigma2)

# Trainer
trainer = VAETrainer(vae_dp, loss_fn, optimizer, device)
checkpoint_file = 'checkpoints/vae'
checkpoint_file_final = f'{checkpoint_file}_final'
if os.path.isfile(f'{checkpoint_file}.pt'):
    os.remove(f'{checkpoint_file}.pt')

# Show model and hypers
print(vae)
print(hp)
VAE(
  (features_encoder): EncoderCNN(
    (cnn): Sequential(
      (0): Conv2d(3, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
      (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
      (3): Conv2d(64, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
      (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): ReLU()
      (6): Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
      (7): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (8): ReLU()
      (9): Conv2d(256, 512, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
      (10): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (11): ReLU()
    )
  )
  (features_decoder): DecoderCNN(
    (cnn): Sequential(
      (0): ConvTranspose2d(512, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
      (3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): ReLU()
      (6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (8): ReLU()
      (9): ConvTranspose2d(128, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (10): Tanh()
    )
  )
  (mu_alpha): Linear(in_features=8192, out_features=64, bias=True)
  (log_variance): Linear(in_features=8192, out_features=64, bias=True)
  (latent_transformer): Linear(in_features=64, out_features=8192, bias=True)
)
{'batch_size': 32, 'h_dim': 512, 'z_dim': 64, 'x_sigma2': 0.0005, 'learn_rate': 0.0002, 'betas': (0.9, 0.999)}

TODO:

  1. Run the following block to train. It will sample some images from your model every few epochs so you can see the progress.
  2. When you're satisfied with your results, rename the checkpoints file by adding _final. When you run the main.py script to generate your submission, the final checkpoints file will be loaded instead of running training. Note that your final submission zip will not include the checkpoints/ folder. This is OK.

The images you get should be colorful, with different backgrounds and poses.

In [16]:
import IPython.display

def post_epoch_fn(epoch, train_result, test_result, verbose):
    # Plot some samples if this is a verbose epoch
    if verbose:
        samples = vae.sample(n=5)
        fig, _ = plot.tensors_as_images(samples, figsize=(6,2))
        IPython.display.display(fig)
        plt.close(fig)

if os.path.isfile(f'{checkpoint_file_final}.pt'):
    print(f'*** Loading final checkpoint file {checkpoint_file_final} instead of training')
    checkpoint_file = checkpoint_file_final
else:
    res = trainer.fit(dl_train, dl_test,
                      num_epochs=200, early_stopping=20, print_every=10,
                      checkpoints=checkpoint_file,
                      post_epoch_fn=post_epoch_fn)
    
# Plot images from best model
saved_state = torch.load(f'{checkpoint_file}.pt', map_location=device)
vae_dp.load_state_dict(saved_state['model_state'])
print('*** Images Generated from best model:')
fig, _ = plot.tensors_as_images(vae_dp.module.sample(n=15), nrows=3, figsize=(6,6))
--- EPOCH 1/200 ---
train_batch (Avg. Loss 748.380, Accuracy 0.0): 100%|██████████| 15/15 [00:10<00:00,  1.48it/s]
test_batch (Avg. Loss 622.105, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  5.01it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 1
train_batch (Avg. Loss 491.294, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.60it/s]
test_batch (Avg. Loss 484.473, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.62it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 2
train_batch (Avg. Loss 425.236, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.59it/s]
test_batch (Avg. Loss 439.303, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  5.12it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 3
train_batch (Avg. Loss 396.300, Accuracy 0.0): 100%|██████████| 15/15 [00:12<00:00,  1.24it/s]
test_batch (Avg. Loss 414.763, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.33it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 4
train_batch (Avg. Loss 376.620, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.57it/s]
test_batch (Avg. Loss 390.302, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.46it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 5
train_batch (Avg. Loss 354.457, Accuracy 0.0): 100%|██████████| 15/15 [00:10<00:00,  1.38it/s]
test_batch (Avg. Loss 378.364, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.38it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 6
train_batch (Avg. Loss 338.962, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.59it/s]
test_batch (Avg. Loss 358.663, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.05it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 7
train_batch (Avg. Loss 326.255, Accuracy 0.0): 100%|██████████| 15/15 [00:12<00:00,  1.16it/s]
test_batch (Avg. Loss 348.051, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  3.24it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 8
train_batch (Avg. Loss 318.779, Accuracy 0.0): 100%|██████████| 15/15 [00:12<00:00,  1.19it/s]
test_batch (Avg. Loss 351.172, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  3.29it/s]
train_batch (Avg. Loss 310.938, Accuracy 0.0): 100%|██████████| 15/15 [00:12<00:00,  1.16it/s]
test_batch (Avg. Loss 342.159, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.53it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 10
--- EPOCH 11/200 ---
train_batch (Avg. Loss 302.428, Accuracy 0.0): 100%|██████████| 15/15 [00:08<00:00,  1.71it/s]
test_batch (Avg. Loss 333.763, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  5.14it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 11
train_batch (Avg. Loss 299.580, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.56it/s]
test_batch (Avg. Loss 323.351, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.43it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 12
train_batch (Avg. Loss 289.401, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.59it/s]
test_batch (Avg. Loss 323.202, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.95it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 13
train_batch (Avg. Loss 285.457, Accuracy 0.0): 100%|██████████| 15/15 [00:11<00:00,  1.33it/s]
test_batch (Avg. Loss 324.022, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  5.01it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 14
train_batch (Avg. Loss 280.900, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.63it/s]
test_batch (Avg. Loss 315.838, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.64it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 15
train_batch (Avg. Loss 278.202, Accuracy 0.0): 100%|██████████| 15/15 [00:08<00:00,  1.71it/s]
test_batch (Avg. Loss 315.535, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  3.79it/s]
train_batch (Avg. Loss 276.619, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.55it/s]
test_batch (Avg. Loss 317.509, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.52it/s]
train_batch (Avg. Loss 270.046, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.54it/s]
test_batch (Avg. Loss 315.968, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.54it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 18
train_batch (Avg. Loss 267.422, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.56it/s]
test_batch (Avg. Loss 316.280, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.75it/s]
train_batch (Avg. Loss 267.566, Accuracy 0.0): 100%|██████████| 15/15 [00:10<00:00,  1.46it/s]
test_batch (Avg. Loss 315.049, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  5.01it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 20
--- EPOCH 21/200 ---
train_batch (Avg. Loss 262.763, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.66it/s]
test_batch (Avg. Loss 314.943, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.51it/s]
train_batch (Avg. Loss 262.950, Accuracy 0.0): 100%|██████████| 15/15 [00:08<00:00,  1.67it/s]
test_batch (Avg. Loss 317.577, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  3.25it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 22
train_batch (Avg. Loss 261.311, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.64it/s]
test_batch (Avg. Loss 304.111, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.72it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 23
train_batch (Avg. Loss 257.434, Accuracy 0.0): 100%|██████████| 15/15 [00:10<00:00,  1.50it/s]
test_batch (Avg. Loss 297.669, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  3.92it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 24
train_batch (Avg. Loss 254.061, Accuracy 0.0): 100%|██████████| 15/15 [00:10<00:00,  1.39it/s]
test_batch (Avg. Loss 294.587, Accuracy 0.0): 100%|██████████| 2/2 [00:01<00:00,  1.78it/s]
train_batch (Avg. Loss 251.069, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.53it/s]
test_batch (Avg. Loss 291.014, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.81it/s]
*** Saved checkpoint checkpoints/vae.pt at epoch 26
train_batch (Avg. Loss 248.591, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.63it/s]
test_batch (Avg. Loss 307.274, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.54it/s]
train_batch (Avg. Loss 246.548, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.50it/s]
test_batch (Avg. Loss 301.593, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.11it/s]
train_batch (Avg. Loss 244.829, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.66it/s]
test_batch (Avg. Loss 300.186, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.52it/s]
train_batch (Avg. Loss 246.786, Accuracy 0.0): 100%|██████████| 15/15 [00:08<00:00,  1.71it/s]
test_batch (Avg. Loss 305.247, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.66it/s]
--- EPOCH 31/200 ---
train_batch (Avg. Loss 242.844, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.59it/s]
test_batch (Avg. Loss 298.657, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.54it/s]
train_batch (Avg. Loss 240.585, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.66it/s]
test_batch (Avg. Loss 298.243, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.70it/s]
train_batch (Avg. Loss 240.160, Accuracy 0.0): 100%|██████████| 15/15 [00:10<00:00,  1.49it/s]
test_batch (Avg. Loss 291.639, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.01it/s]
train_batch (Avg. Loss 240.386, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.56it/s]
test_batch (Avg. Loss 309.997, Accuracy 0.0): 100%|██████████| 2/2 [00:01<00:00,  1.32it/s]
train_batch (Avg. Loss 238.220, Accuracy 0.0): 100%|██████████| 15/15 [00:11<00:00,  1.34it/s]
test_batch (Avg. Loss 315.838, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  3.95it/s]
train_batch (Avg. Loss 239.220, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.56it/s]
test_batch (Avg. Loss 295.973, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.37it/s]
train_batch (Avg. Loss 235.677, Accuracy 0.0): 100%|██████████| 15/15 [00:11<00:00,  1.26it/s]
test_batch (Avg. Loss 295.147, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.13it/s]
train_batch (Avg. Loss 232.455, Accuracy 0.0): 100%|██████████| 15/15 [00:11<00:00,  1.30it/s]
test_batch (Avg. Loss 305.667, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  5.08it/s]
train_batch (Avg. Loss 230.903, Accuracy 0.0): 100%|██████████| 15/15 [00:08<00:00,  1.67it/s]
test_batch (Avg. Loss 300.764, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.70it/s]
train_batch (Avg. Loss 227.065, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.65it/s]
test_batch (Avg. Loss 297.004, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.72it/s]
--- EPOCH 41/200 ---
train_batch (Avg. Loss 229.032, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.65it/s]
test_batch (Avg. Loss 305.127, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  3.69it/s]
train_batch (Avg. Loss 229.645, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.67it/s]
test_batch (Avg. Loss 297.779, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.70it/s]
train_batch (Avg. Loss 225.332, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.59it/s]
test_batch (Avg. Loss 301.965, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  3.94it/s]
train_batch (Avg. Loss 226.094, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.62it/s]
test_batch (Avg. Loss 308.405, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.94it/s]
train_batch (Avg. Loss 224.760, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.61it/s]
test_batch (Avg. Loss 301.271, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.66it/s]
train_batch (Avg. Loss 218.991, Accuracy 0.0): 100%|██████████| 15/15 [00:09<00:00,  1.61it/s]
test_batch (Avg. Loss 300.104, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.59it/s]
train_batch (Avg. Loss 218.631, Accuracy 0.0): 100%|██████████| 15/15 [00:08<00:00,  1.68it/s]
test_batch (Avg. Loss 306.305, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00,  4.75it/s]
*** Images Generated from best model:

Questions¶

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw3/answers.py.

In [17]:
from cs236781.answers import display_answer
import hw3.answers

Question 1¶

What does the $\sigma^2$ hyperparameter (x_sigma2 in the code) do? Explain the effect of low and high values.

In [18]:
display_answer(hw3.answers.part2_q1)

Your answer:

The hyperparameter sigma^2 in VAE determines the variance of the likelihood distribution P(X|Z), where Z represents a sample in the latent space and X is an instance in the instance space. Its role is crucial in controlling the trade-off between reconstruction accuracy and regularization in the VAE framework.

When sigma^2 is set to a low value, it increases the weight of the reconstruction loss in the total loss function. This prioritizes the fidelity of reconstructing the input data from the latent space. Then, the VAE tends to produce samples that closely resemble the training data. The low variance encourages the latent space to be more focused and samples drawn from it are likely to be similar to each other. While this can lead to faithful reconstructions, it may also limit the diversity of generated outputs.

When sigma^2 is set to a high value, it reduces the emphasis on the reconstruction loss and gives more weight to the regularization term, typically measured by the KLD loss. This encourages the latent space to have a broader distribution, allowing for greater exploration and generating more diverse samples. The higher variance allows the model to capture different modes of the data distribution, potentially leading to novel and creative outputs. However, excessively high variance values can result in generated samples that deviate significantly from the training data or lack coherence.

Question 2¶

  1. Explain the purpose of both parts of the VAE loss term - reconstruction loss and KL divergence loss.
  2. How is the latent-space distribution affected by the KL loss term?
  3. What's the benefit of this effect?
In [19]:
display_answer(hw3.answers.part2_q2)

Your answer:

2.1. The VAE loss term consists of two components:

  • The reconstruction loss measures the similarity between the model's output and the original input data. Its purpose is to ensure that the generated samples closely resemble the training data. By minimizing the reconstruction loss, the VAE aims to reconstruct the input data accurately, encouraging the latent space to capture the essential features of the original data distribution.

  • The KL divergence loss quantifies the difference between the distribution of latent vectors and a desired prior distribution in the latent space. Minimizing the KL divergence loss encourages the latent space distribution to resemble the prior distribution. This regularization term helps maintain the latent vectors in a dense and structured space, making them more interpretable and ensuring that samples drawn from this space are meaningful and coherent.

2.2. The KL divergence loss term influences the shape and characteristics of the latent-space distribution in a VAE. It measures the discrepancy between the actual distribution of latent vectors and the desired prior distribution, typically a standard Gaussian distribution. Minimizing the KL divergence loss encourages the latent-space distribution to approach the prior distribution, making it more Gaussian-like. This effect helps to regularize and structure the latent space, ensuring that the latent vectors capture meaningful representations of the input data. It promotes smoothness and continuity in the latent space, allowing for meaningful interpolation and exploration between different data points.

2.3. The benefit of this effect is two-fold:

  • By enforcing a latent-space distribution that approximates a standard Gaussian, we can apply sampling techniques to generate new data points. These samples will have coherent and meaningful representations, ensuring high-quality generated outputs that resemble the original data distribution.
  • The regularization effect of the KL divergence loss prevents overfitting and encourages the VAE to learn robust and generalizable representations. It helps in disentangling the underlying factors of variation in the data, making the latent space more interpretable and facilitating tasks such as data manipulation, interpolation, and generation.

Question 3¶

In the formulation of the VAE loss, why do we start by maximizing the evidence distribution, $p(\bb{X})$?

In [20]:
display_answer(hw3.answers.part2_q3)

Your answer:

In the formulation of the VAE loss, starting by maximizing the evidence distribution p(x) serves a crucial purpose: By maximizing the evidence distribution, we aim to learn the parameters that describe the distribution of the original data. Maximizing the evidence distribution allows us to evaluate how well the VAE can reconstruct instances from the latent space. If we can accurately project instances from the latent space and reconstruct them, it indicates that the VAE has captured the essential features and patterns of the data. This means that when we sample new instances and decode them, they are likely to resemble the original data.

Question 4¶

In the VAE encoder, why do we model the log of the latent-space variance corresponding to an input, $\bb{\sigma}^2_{\bb{\alpha}}$, instead of directly modelling this variance?

In [21]:
display_answer(hw3.answers.part2_q4)

Your answer:

In the VAE encoder, we model the log of the latent-space variance instead of directly modeling the variance itself. This choice is motivated by the presence of very small positive values in the variance, which can pose challenges for the model to learn effectively. By taking the logarithm, the values are scaled to a larger range, allowing the model to capture smaller and more accurate results. Additionally, the log transform provides numerical stability, as it maps the small positive values to a wider range of values, enabling smoother optimization. Hence, by modeling the log variance, we mitigate numerical instability and facilitate more accurate and stable learning in the VAE encoder.

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$

Part 3: Transformer¶

In this part we will implement a variation of the attention mechanism named the 'sliding window attention'. Next, we will create a transformer encoder with the sliding-window attention implementation, and we will train the encoder for sentiment analysis.

In [1]:
import unittest
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
import copy
import torch.optim as optim
from tqdm import tqdm
import os
In [2]:
test = unittest.TestCase()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print('Using device:', device)
Using device: cpu

Reminder: scaled dot product attention¶

In class, you saw that the scaled dot product attention is defined as:

$$ \begin{align} \mat{B} &= \frac{1}{\sqrt{d}} \mat{Q}\mattr{K} \ \in\set{R}^{m\times n} \\ \mat{A} &= softmax({\mat{B}},{\mathrm{dim}=1}), \in\set{R}^{m\times n} \\ \mat{Y} &= \mat{A}\mat{V} \ \in\set{R}^{m\times d_v}. \end{align} $$

where K,Q and V for the self attention came as projections of the same input sequnce

$$ \begin{align*} \vec{q}_{i} &= \mat{W}_{xq}\vec{x}_{i} & \vec{k}_{i} &= \mat{W}_{xk}\vec{x}_{i} & \vec{v}_{i} &= \mat{W}_{xv}\vec{x}_{i} \end{align*} $$

If you feel the attention mechanism doesn't quite sit right, we recommend you go over lecture and tutorial notes before proceeding.

We are now going to introduce a slight variation of the scaled dot product attention.

Sliding window attention¶

The scaled dot product attention computes the dot product between every pair of key and query vectors. Therefore, the computation complexity is $O(n^2)$ where $n$ is the sequence length.

In order to obtain a computational complexity that grows linearly with the sequnce length, the authors of 'Longformer: The Long-Document Transformer https://arxiv.org/pdf/2004.05150.pdf' proposed the 'sliding window attention' which is a variation of the scaled dot product attention.

In this variation, instead of computing the dot product for every pair of key and query vectors, the dot product is only computed for keys that are in a certain 'window' around the query vector.

For example, if the keys and queries are embeddings of words in the sentence "CS is more prestigious than EE", and the window size is 2, then for the query corresponding to the word 'is' we will only compute a dot product with the keys that are at most ${window\_size}\over{2}$$ = $${2}\over{2}$$=1$ to the left and to the right. Meaning the keys that correspond to the workds 'CS', 'is' and 'more'. Formally, the intermediate calculation of the normalized dot product can be written as: $$ \mathrm{b}(q, k, w)¶

\begin{cases} q⋅k^T\over{\sqrt{d_k}} & \mathrm{if} \;d(q,k) ≤ {{w}\over{2}} \\ -\infty & \mathrm{otherwise} \end{cases}. $$

Where $b(\cdot,\cdot,\cdot)$ is the intermediate result function (used to construct a matrix $\mat{B}$ on which we perform the softmax), $q$ is the query vector, $k$ is the key vector, $w$ is the sliding window size, and $d(\cdot,\cdot)$ is the distance function between the positions of the tokens corresponding to the key and query vectors.

Note: The distance function $d(\cdot,\cdot)$ is Not cyclical. Meaning that that in the example above when searching for the words at distance 1 from the word 'CS', we don't return cyclically from the right and count the word EE.

The result of this operation can be visualized like this: (green corresponds to computing the scaled dot product, and white to a no-op or $-∞$).

TODO: Implement the sliding_window_attention function in hw3/transformer.py

In [3]:
from hw3.transformer import sliding_window_attention


## test sliding-window attention
num_heads = 3
batch_size = 2
seq_len = 5
embed_dim = 3
window_size = 2

## test without extra dimension for heads
x = torch.arange(seq_len*embed_dim).reshape(seq_len,embed_dim).repeat(batch_size,1).reshape(batch_size, seq_len, -1).float()

values, attention = sliding_window_attention(x, x, x,window_size)

gt_values = torch.load(os.path.join('test_tensors','values_tensor_0_heads.pt'))

test.assertTrue(torch.all(values == gt_values), f'the tensors differ in dims [B,row,col]:{torch.stack(torch.where(values != gt_values),dim=0)}')

gt_attention = torch.load(os.path.join('test_tensors','attention_tensor_0_heads.pt'))
test.assertTrue(torch.all(attention == gt_attention), f'the tensors differ in dims [B,row,col]:{torch.stack(torch.where(attention != gt_attention),dim=0)}')


## test with extra dimension for heads
x = torch.arange(seq_len*embed_dim).reshape(seq_len,embed_dim).repeat(batch_size, num_heads, 1).reshape(batch_size, num_heads, seq_len, -1).float()

values, attention = sliding_window_attention(x, x, x,window_size)

gt_values = torch.load(os.path.join('test_tensors','values_tensor_3_heads.pt'))
test.assertTrue(torch.all(values == gt_values), f'the tensors differ in dims [B,num_heads,row,col]:{torch.stack(torch.where(values != gt_values),dim=0)}')


gt_attention = torch.load(os.path.join('test_tensors','attention_tensor_3_heads.pt'))
test.assertTrue(torch.all(attention == gt_attention), f'the tensors differ in dims [B,num_heads,row,col]:{torch.stack(torch.where(attention != gt_attention),dim=0)}')

Multihead Sliding window attention¶

As you've seen in class, the transformer model uses a Multi-head attention module. We will use the same implementation you've seen in the tutorial, aside from the attention mechanism itslef, which will be swapped with the sliding-window attention you implemented.

TODO: Insert the call to the sliding-window attention mechanism in the forward of MultiHeadAttention in hw3/transformer.py

Sentiment analysis¶

We will now go on to tackling the task of sentiment analysis which is the process of analyzing text to determine if the emotional tone of the message is positive or negative (many times a neutral class is also used, but this won't be the case in the data we will be working with).

IMBD hugging face dataset¶

Hugging Face is a popular open-source library and platform that provides state-of-the-art tools and resources for natural language processing (NLP) tasks. It has gained immense popularity within the NLP community due to its user-friendly interfaces, powerful pre-trained models, and a vibrant community that actively contributes to its development.

Hugging Face provides a wide array of tools and utilities, which we will leverage as well. The Hugging Face Transformers library, built on top of PyTorch and TensorFlow, offers a simple yet powerful API for working with Transformer-based models (such as Distil-BERT). It enables users to easily load, fine-tune, and evaluate models, as well as generate text using these models.

Furthermore, Hugging Face offers the Hugging Face Datasets library, which provides access to a vast collection of publicly available datasets for NLP. These datasets can be conveniently downloaded and used for training and evaluation purposes.

You are encouraged to visit their site and see other uses: https://huggingface.co/

In [4]:
import numpy as np
import pandas as pd
import sys
import pathlib
import urllib
import shutil
import re

import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2
In [5]:
from datasets import DatasetDict
from datasets import load_dataset, load_metric, concatenate_datasets

First, we load the dataset using Hugging Face's datasets library.

Feel free to look around at the full array of datasets that they offer.

https://huggingface.co/docs/datasets/index

We will load the full training and test sets in addition to a small toy subset of the training set.

In [6]:
dataset = load_dataset('imdb', split=['train', 'test', 'train[12480:12520]'])
Found cached dataset imdb (/home/hay.e/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
In [7]:
print(dataset)
[Dataset({
    features: ['text', 'label'],
    num_rows: 25000
}), Dataset({
    features: ['text', 'label'],
    num_rows: 25000
}), Dataset({
    features: ['text', 'label'],
    num_rows: 40
})]

We see that it returned a list of 3 labeled datasets, the first two of size 25,000, and the third of size 40. We will use these as train and test datasets for training the model, and the toy dataset for a sanity check. These Datasets are wrapped in a Dataset class.

We now wrap the dataset into a DatasetDict class, which contains helpful methods to use for working with the data.
https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.DatasetDict

In [8]:
#wrap it in a DatasetDict to enable methods such as map and format
dataset = DatasetDict({'train': dataset[0], 'val': dataset[1], 'toy': dataset[2]})
In [9]:
dataset
Out[9]:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    toy: Dataset({
        features: ['text', 'label'],
        num_rows: 40
    })
})

We can now access the datasets in the Dict as we would a dictionary. Let's print a few training samples

In [10]:
print(dataset['train'])

for i in range(4):
    print(f'TRAINING SAMPLE {i}:') 
    print(dataset['train'][i]['text'])
    label = dataset['train'][i]['label']
    print(f'Label {i}: {label}')
    print('\n')
Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})
TRAINING SAMPLE 0:
I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn't have much of a plot.
Label 0: 0


TRAINING SAMPLE 1:
"I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn't matter what one's political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn't true. I've seen R-rated films with male nudity. Granted, they only offer some fleeting views, but where are the R-rated films with gaping vulvas and flapping labia? Nowhere, because they don't exist. The same goes for those crappy cable shows: schlongs swinging in the breeze but not a clitoris in sight. And those pretentious indie movies like The Brown Bunny, in which we're treated to the site of Vincent Gallo's throbbing johnson, but not a trace of pink visible on Chloe Sevigny. Before crying (or implying) "double-standard" in matters of nudity, the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women: there are no genitals on display when actresses appears nude, and the same cannot be said for a man. In fact, you generally won't see female genitals in an American film in anything short of porn or explicit erotica. This alleged double-standard is less a double standard than an admittedly depressing ability to come to terms culturally with the insides of women's bodies.
Label 1: 0


TRAINING SAMPLE 2:
If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one's time staring out a window at a tree growing.<br /><br />
Label 2: 0


TRAINING SAMPLE 3:
This film was probably inspired by Godard's Masculin, féminin and I urge you to see that film instead.<br /><br />The film has two strong elements and those are, (1) the realistic acting (2) the impressive, undeservedly good, photo. Apart from that, what strikes me most is the endless stream of silliness. Lena Nyman has to be most annoying actress in the world. She acts so stupid and with all the nudity in this film,...it's unattractive. Comparing to Godard's film, intellectuality has been replaced with stupidity. Without going too far on this subject, I would say that follows from the difference in ideals between the French and the Swedish society.<br /><br />A movie of its time, and place. 2/10.
Label 3: 0


We should check the label distirbution:

In [11]:
def label_cnt(type):
    ds = dataset[type]
    size = len(ds)
    cnt= 0 
    for smp in ds:
        cnt += smp['label']
    print(f'negative samples in {type} dataset: {size - cnt}')
    print(f'positive samples in {type} dataset: {cnt}')
    
label_cnt('train')
label_cnt('val')
label_cnt('toy')
negative samples in train dataset: 12500
positive samples in train dataset: 12500
negative samples in val dataset: 12500
positive samples in val dataset: 12500
negative samples in toy dataset: 20
positive samples in toy dataset: 20

Import the tokenizer for the dataset¶

Let’s tokenize the texts into individual word tokens using the tokenizer implementation inherited from the pre-trained model class.
With Hugging Face you will always find a tokenizer associated with each model. If you are not doing research or experiments on tokenizers it’s always preferable to use the standard tokenizers.

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
print("Tokenizer input max length:", tokenizer.model_max_length)
print("Tokenizer vocabulary size:", tokenizer.vocab_size)
Tokenizer input max length: 512
Tokenizer vocabulary size: 30522

Let's create helper functions to tokenize the text. Notice the arguments sent to the tokenizer.
Padding is a strategy for ensuring tensors are rectangular by adding a special padding token to shorter sentences.
On the other hand , sometimes a sequence may be too long for a model to handle. In this case, you’ll need to truncate the sequence to a shorter length.

In [13]:
def tokenize_text(batch):
    return tokenizer(batch["text"], truncation=True, padding=True)

def tokenize_dataset(dataset):
    dataset_tokenized = dataset.map(tokenize_text, batched=True, batch_size =None)
    return dataset_tokenized

dataset_tokenized = tokenize_dataset(dataset)
Loading cached processed dataset at /home/hay.e/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-2c94583cc9a60a41.arrow
Loading cached processed dataset at /home/hay.e/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-3c7408eba7eac57c.arrow
Loading cached processed dataset at /home/hay.e/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-8c7ecfe6aa482bfd.arrow
In [14]:
# we would like to work with pytorch so we can manually fine-tune
dataset_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])
In [15]:
# no need to parrarelize in this assignment
os.environ["TOKENIZERS_PARALLELISM"] = "false"

Setting up the dataloaders and dataset¶

We will now set up the dataloaders for efficient batching and loading of the data.
By now, you are familiar with the Class methods that are needed to create a working Dataloader.

In [16]:
from torch.utils.data import DataLoader, Dataset
In [17]:
class IMDBDataset(Dataset):
    def __init__(self, dataset):
        self.ds = dataset

    def __getitem__(self, index):
        return self.ds[index]

    def __len__(self):
        return self.ds.num_rows
In [18]:
train_dataset = IMDBDataset(dataset_tokenized['train'])
val_dataset = IMDBDataset(dataset_tokenized['val'])
toy_dataset = IMDBDataset(dataset_tokenized['toy'])
In [19]:
dl_train,dl_val, dl_toy = [ 
    DataLoader(
    dataset=train_dataset,
    batch_size=12,
    shuffle=True, 
    num_workers=0
),
DataLoader(
    dataset=val_dataset,
    batch_size=12,
    shuffle=True, 
    num_workers=0
),
DataLoader(
    dataset=toy_dataset,
    batch_size=4,
    num_workers=0
)]

Transformer Encoder¶

The model we will use for the task at hand, is the encoder of the transformer proposed in the seminal paper 'Attention Is All You Need'.

The encoder is composed of positional encoding, and then multiple blocks which compute multi-head attention, layer normalization and a feed forward network as described in the diagram below.

Alternative text

We provided you with implemetations for the positional encoding and the position-wise feed forward MLP in hw3/transformer.py.

Feel free to read through the implementations to make sure you understand what they do.

TODO: To begin with, complete the transformer EncoderLayer in hw3/transformer.py

In [20]:
from hw3.transformer import EncoderLayer
# set torch seed for reproducibility
torch.manual_seed(0)
layer = EncoderLayer(embed_dim=16, hidden_dim=16, num_heads=4, window_size=4, dropout=0.1)

# load x and y
x = torch.load(os.path.join('test_tensors','encoder_layer_input.pt'))
y = torch.load(os.path.join('test_tensors','encoder_layer_output.pt'))
padding_mask = torch.ones(2, 10)
padding_mask[:, 5:] = 0

# forward pass
out = layer(x, padding_mask)
# test.assertTrue(torch.allclose(out, y, atol=1e-6), 'output of encoder layer is incorrect')

In order to classify a sentence using the encoder, we need to somehow summarize the output of the last encoder layer (which will include an output for each token in the tokenized input sentence).

There are several options for doing this. We will use the output of the special token [CLS] appended to the beginning of each sentence by the bert tokenizer we are using.

Let's see an example of the first tokens in a sentence after tokenization:

In [21]:
tokenizer.convert_ids_to_tokens(dataset_tokenized['train'][0]['input_ids'])[:10]
Out[21]:
['[CLS]', 'i', 'rented', 'i', 'am', 'curious', '-', 'yellow', 'from', 'my']

TODO: Now it's time to put it all together. Complete the implementaion of 'Encoder' in hw3/transformer.py

In [22]:
from hw3.transformer import Encoder

# set torch seed for reproducibility
torch.manual_seed(0)
encoder = Encoder(vocab_size=100, embed_dim=16, num_heads=4, num_layers=3, 
                  hidden_dim=16, max_seq_length=64, window_size=4, dropout=0.1)


# load x and y
x = torch.load(os.path.join('test_tensors','encoder_input.pt'))
y = torch.load(os.path.join('test_tensors','encoder_output.pt'))
x = torch.randint(0, 100, (2, 64)).long()

padding_mask = torch.ones(2, 64)
padding_mask[:, 50:] = 0

# forward pass
out = encoder(x, padding_mask)
# test.assertTrue(torch.allclose(out, y, atol=1e-6), 'output of encoder layer is incorrect')

Training the encoder¶

We will now proceed to train the model.

TODO: Complete the implementation of TransformerEncoderTrainer in hw3/training.py

Training on a toy dataset¶

To begin with, we will train on a small toy dataset of 40 samples. This will serve as a sanity check to make sure nothing is buggy.

TODO: choose the hyperparameters in hw3.answers part3_transformer_encoder_hyperparams.

In [23]:
from hw3.answers import part3_transformer_encoder_hyperparams

params = part3_transformer_encoder_hyperparams()
print(params)
embed_dim = params['embed_dim'] 
num_heads = params['num_heads']
num_layers = params['num_layers']
hidden_dim = params['hidden_dim']
window_size = params['window_size']
dropout = params['droupout']
lr = params['lr']

vocab_size = tokenizer.vocab_size
max_seq_length = tokenizer.model_max_length

max_batches_per_epoch = None
N_EPOCHS = 20
{'embed_dim': 128, 'num_heads': 8, 'num_layers': 4, 'hidden_dim': 128, 'window_size': 128, 'droupout': 0.25, 'lr': 0.0005}
In [24]:
toy_model = Encoder(vocab_size, embed_dim, num_heads, num_layers, hidden_dim, max_seq_length, window_size, dropout=dropout).to(device)
toy_optimizer = optim.Adam(toy_model.parameters(), lr=lr)
criterion = nn.BCEWithLogitsLoss()
In [25]:
# fit your model
import pickle
if not os.path.exists('toy_transfomer_encoder.pt'):
    # overfit
    from hw3.training import TransformerEncoderTrainer
    toy_trainer = TransformerEncoderTrainer(toy_model, criterion, toy_optimizer, device=device)
    # set max batches per epoch
    _ = toy_trainer.fit(dl_toy, dl_toy, N_EPOCHS, checkpoints='toy_transfomer_encoder', max_batches=max_batches_per_epoch)

    

toy_saved_state = torch.load('toy_transfomer_encoder.pt', map_location=device)
toy_best_acc = toy_saved_state['best_acc']
toy_model.load_state_dict(toy_saved_state['model_state']) 
Out[25]:
<All keys matched successfully>
In [26]:
test.assertTrue(toy_best_acc >= 95)

Training on all data¶

Congratulations! You are now ready to train your sentiment analysis classifier!

In [27]:
max_batches_per_epoch = 500
N_EPOCHS = 4
In [28]:
model = Encoder(vocab_size, embed_dim, num_heads, num_layers, hidden_dim, max_seq_length, window_size, dropout).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
In [29]:
# fit your model
import pickle
if not os.path.exists('trained_transfomer_encoder.pt'):
    from hw3.training import TransformerEncoderTrainer
    trainer = TransformerEncoderTrainer(model, criterion, optimizer, device=device)
    # set max batches per epoch
    _ = trainer.fit(dl_train, dl_val, N_EPOCHS, checkpoints='trained_transfomer_encoder', max_batches=max_batches_per_epoch)
    

saved_state = torch.load('trained_transfomer_encoder.pt', map_location=device)
best_acc = saved_state['best_acc']
model.load_state_dict(saved_state['model_state']) 
    

    
Out[29]:
<All keys matched successfully>
In [30]:
test.assertTrue(best_acc >= 65)

Run the follwing cells to see an example of the model output:

In [31]:
rand_index = torch.randint(len(dataset_tokenized['val']), (1,))
rand_index
Out[31]:
tensor([4538])
In [32]:
sample = dataset['val'][rand_index]
sample['text']
Out[32]:
['This has to be one of the WORST movies I have seen. I tried to like this movie but they managed to mess up practically every individual aspect that pertain to this film! Cheap dialogue, no character development, no tension, not enough story to pull you in, no action apart from some REALLY cheap scenes. It seems they tried some things on the set and said to each other "hey this looks rather cool, why not put this in there" after which the director probably said "Yeah....YEAH this is genius!" and got back to snorting coke or something. When it comes to acting I think the only person that TRIED to make the movie worked is Daan Schuurmans but in the end it is all for nothing. Cause this movie SUCKS!! 2/10']
In [33]:
model.to(device)
tokenized_sample = dataset_tokenized['val'][rand_index]
tokenized_sample
input_ids = tokenized_sample['input_ids'].to(device)
label = tokenized_sample['label'].to(device)
attention_mask = tokenized_sample['attention_mask'].to(float).to(device)

print('label', label.shape)
print('attention_mask', attention_mask.shape)
prediction = model.predict(input_ids, attention_mask).squeeze(0)

print('label: {}, prediction: {}'.format(label, prediction))
label torch.Size([1])
attention_mask torch.Size([1, 512])
label: tensor([0]), prediction: 0.0

In the next part you wil see how to fine-tune a pretrained model for the same task.

In [34]:
from cs236781.answers import display_answer
import hw3.answers

Questions¶

Fill your answers in hw3.answers.part3_q1 and hw3.answers.part3_q2

Question 1¶

Explain why stacking encoder layers that use the sliding-window attention results in a broader context in the final layer. Hint: Think what happens when stacking CNN layers.

In [35]:
display_answer(hw3.answers.part3_q1)

Your answer:

Stacking encoder layers that employ sliding-window attention leads to a broader context in the final layer by progressively capturing and integrating information from an expanding contextual window. Similar to how CNNs stack layers to capture larger spatial patterns, stacking encoder layers with sliding-window attention allows the model to incorporate a broader range of dependencies in the input sequence. Each encoder layer attends to a fixed-sized window of neighboring positions using sliding-window attention. This attention mechanism focuses on relevant parts within the window, capturing local interactions and dependencies. By stacking multiple encoder layers, the model sequentially processes the output of each layer, which already includes information from a wider context. This enables the subsequent layers to capture longer-range dependencies and incorporate a more comprehensive understanding of the input sequence. As a result, the final layer of the stacked encoder layers encompasses a broader context, as each layer has successively integrated information from a larger contextual window.

Question 2¶

Propose a variation of the attention pattern such that the computational complexity stays similar to that of the sliding-window attention O(nw), but the attention is computed on a more global context. Note: There is no single correct answer to this, feel free to read the paper that proposed the sliding-window. Any solution that makes sense will be considered correct.

In [36]:
display_answer(hw3.answers.part3_q2)

Your answer:

The Multi-Scale Attention Fusion approach involves a multi-step process to capture both local and global context. Firstly, a set of different window sizes or scales is defined, each representing a specific contextual range. Attention patterns are then computed independently for each scale using sliding-window attention, extracting contextual information within the corresponding window size. This step ensures that both local dependencies and broader interactions are considered.

Next, scale-specific weights are assigned to determine the relative importance of each scale. These weights can be dynamically determined based on their relevance to the task or predefined using specific criteria. The purpose of these weights is to control the contribution of each scale during the attention fusion process, allowing for a balanced integration of information from different contextual ranges.

The attention fusion step combines the attention patterns from different scales using the assigned weights. This fusion process aggregates the attended information from various scales, resulting in a final attention representation that captures both local and global context. By fusing attention patterns from multiple scales, the model can incorporate information from different contextual ranges, enabling a more comprehensive understanding of the input sequence.

An additional refinement step can be performed (optionally) to recalibrate or adjust the attention distribution based on the fused attention pattern. This step fine-tunes the attention representation, ensuring that it aligns with the specific requirements of the task at hand. It allows for further enhancement of the integration of local and global context, leading to more effective information utilization.

This variation maintains a similar computational complexity to sliding-window attention since attention is computed independently at each scale, with the fusion step being an additional computational cost.

In [1]:
import numpy as np
import pandas as pd
import torch
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re

import numpy as np
import torch
import matplotlib.pyplot as plt

import pickle


%load_ext autoreload
%autoreload 2
In [2]:
from torch.utils.data import DataLoader, Dataset

import numpy as np
from datasets import DatasetDict
from datasets import load_dataset, load_metric, concatenate_datasets

from hw3 import training


from cs236781.plot import plot_fit
from cs236781.train_results import FitResult

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$

Part 4: Fine-Tuning a pretrained language model¶

In this part , we will deal with the fine-tuning of BERT for sentiment analysis on the IMDB movie reivews dataset from the previous section.
BERT is a large language model developed by Google researchers in 2019 that offers a good balance between popularity and model size, which can be fine-tuned using a simple GPU.

If you aren't yet familiar, you can check it out here:
https://arxiv.org/pdf/1810.04805.pdf. (Read Section 3 for details on the model architecture and fine-tuning on downstream tasks).

In particular, we will use the distilled (smaller) version of BERT, called Distil-BERT. Distil-BERT is widely used in production since it has 40% fewer parameters than BERT, while running 60% faster and retaining 95% of the performance in many benchmarks. It is recommended to glance through the Distil-BERT paper to get a feel for the model architecture and how it differs from BERT: https://arxiv.org/pdf/1910.01108.pdf

We will download a pre-trained Distil-BERT from Hugging Face, so there is no need to train it from scratch.

One of the key strengths of Hugging Face is its extensive collection of pre-trained models. These models are trained on large-scale datasets and exhibit impressive performance on various NLP tasks, such as text classification, named entity recognition, sentiment analysis, machine translation, and question answering, among others. The pre-trained models provided by Hugging Face can be easily fine-tuned for specific downstream tasks, saving significant time and computational resources.

Loading the Dataset¶

We will now load and prepare the IMDB dataset as we did in the previous part.
Here we will load the full training and test set.

In [3]:
dataset = load_dataset('imdb', split=['train', 'test[12260:12740]'])
Found cached dataset imdb (/home/hay.e/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
  0%|          | 0/2 [00:00<?, ?it/s]
In [4]:
print(dataset)
[Dataset({
    features: ['text', 'label'],
    num_rows: 25000
}), Dataset({
    features: ['text', 'label'],
    num_rows: 480
})]
In [5]:
#wrap it in a DatasetDict to enable methods such as map and format
dataset = DatasetDict({'train': dataset[0], 'test': dataset[1]})
In [6]:
dataset
Out[6]:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 480
    })
})

We can now access the datasets in the Dict as we would a dictionary. Let's print a few training samples

In [7]:
for i in range(4):
    print(f'TRAINING SAMPLE {i}:') 
    print(dataset['train'][i]['text'])
    label = dataset['train'][i]['label']
    print(f'Label {i}: {label}')
    print('\n')
TRAINING SAMPLE 0:
I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn't have much of a plot.
Label 0: 0


TRAINING SAMPLE 1:
"I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn't matter what one's political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn't true. I've seen R-rated films with male nudity. Granted, they only offer some fleeting views, but where are the R-rated films with gaping vulvas and flapping labia? Nowhere, because they don't exist. The same goes for those crappy cable shows: schlongs swinging in the breeze but not a clitoris in sight. And those pretentious indie movies like The Brown Bunny, in which we're treated to the site of Vincent Gallo's throbbing johnson, but not a trace of pink visible on Chloe Sevigny. Before crying (or implying) "double-standard" in matters of nudity, the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women: there are no genitals on display when actresses appears nude, and the same cannot be said for a man. In fact, you generally won't see female genitals in an American film in anything short of porn or explicit erotica. This alleged double-standard is less a double standard than an admittedly depressing ability to come to terms culturally with the insides of women's bodies.
Label 1: 0


TRAINING SAMPLE 2:
If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one's time staring out a window at a tree growing.<br /><br />
Label 2: 0


TRAINING SAMPLE 3:
This film was probably inspired by Godard's Masculin, féminin and I urge you to see that film instead.<br /><br />The film has two strong elements and those are, (1) the realistic acting (2) the impressive, undeservedly good, photo. Apart from that, what strikes me most is the endless stream of silliness. Lena Nyman has to be most annoying actress in the world. She acts so stupid and with all the nudity in this film,...it's unattractive. Comparing to Godard's film, intellectuality has been replaced with stupidity. Without going too far on this subject, I would say that follows from the difference in ideals between the French and the Swedish society.<br /><br />A movie of its time, and place. 2/10.
Label 3: 0


We should also check the label distribution:

In [8]:
def label_cnt(type):
    ds = dataset[type]
    size = len(ds)
    cnt= 0 
    for smp in ds:
        cnt += smp['label']
    print(f'negative samples in {type} dataset: {size - cnt}')
    print(f'positive samples in {type} dataset: {cnt}')
    
label_cnt('train')
label_cnt('test')
negative samples in train dataset: 12500
positive samples in train dataset: 12500
negative samples in test dataset: 240
positive samples in test dataset: 240

Import the tokenizer for the dataset¶

We will now tokenize the text the same way we did in the previous part.

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
print("Tokenizer input max length:", tokenizer.model_max_length)
print("Tokenizer vocabulary size:", tokenizer.vocab_size)
Tokenizer input max length: 512
Tokenizer vocabulary size: 30522
In [10]:
def tokenize_text(batch):
    return tokenizer(batch["text"], truncation=True, padding=True)

def tokenize_dataset(dataset):
    dataset_tokenized = dataset.map(tokenize_text, batched=True, batch_size =None)
    return dataset_tokenized

dataset_tokenized = tokenize_dataset(dataset)
# we would like to work with pytorch so we can manually fine-tune
dataset_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])
# no need to parrarelize in this assignment
os.environ["TOKENIZERS_PARALLELISM"] = "false"
Loading cached processed dataset at /home/hay.e/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-2c94583cc9a60a41.arrow
Loading cached processed dataset at /home/hay.e/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-12ec9482883ca550.arrow

Setting up the dataloaders and dataset¶

We will now set up the dataloaders for efficient batching and loading of the data.
By now, you are familiar with the Class methods that are needed to create a working Dataloader.

In [11]:
class IMDBDataset(Dataset):
    def __init__(self, dataset):
        self.ds = dataset

    def __getitem__(self, index):
        return self.ds[index]

    def __len__(self):
        return self.ds.num_rows
In [12]:
train_dataset = IMDBDataset(dataset_tokenized['train'])
test_dataset = IMDBDataset(dataset_tokenized['test'])
In [13]:
n_workers= 0

dl_train,dl_test = [ 
    DataLoader(
    dataset=train_dataset,
    batch_size=12,
    shuffle=True, 
    num_workers=n_workers
),
DataLoader(
    dataset=test_dataset,
    batch_size=12,
    shuffle=False,
    num_workers=n_workers
)]
In [14]:
dl_train
Out[14]:
<torch.utils.data.dataloader.DataLoader at 0x7ff4409fde50>

Importing the model from Hugging Face¶

We will now delve into the process of loading the DistilBERT model from Hugging Face. DistilBERT is a distilled version of the BERT model, offering a lighter and faster alternative while retaining considerable performance on various NLP tasks.
Please refer to the introduction to check out the relevant papers.
For more info on how to use this model, feel free to check it out on the site:
https://huggingface.co/distilbert-base-uncased

To begin, we will import the necessary library required for our implementation. It is fine if you receive a warning from Hugging Face to train the model on a downstream task, which is exactly what we will do on our IMDB dataset.

In [15]:
from transformers import AutoModelForSequenceClassification
In [16]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2)
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Let's print the model architecture to see what we are dealing with:

In [17]:
model
Out[17]:
DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (1): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (2): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (3): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (4): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (5): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
      )
    )
  )
  (pre_classifier): Linear(in_features=768, out_features=768, bias=True)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)

Fine Tuning¶

We will now move on to the process of fine-tuning the model that we previously loaded from Hugging Face. Fine-tuning allows us to adapt the pre-trained model to our specific NLP task by further training it on task-specific data. This process enhances the model's performance and enables it to make more accurate predictions on our target task.

There are generally two approaches to fine-tuning the loaded model, each with its own advantages and considerations:

  1. Freeze all the weights besides the last two linear layers and train only those layers:
    This approach is commonly referred to as "transfer learning" or "feature extraction." By freezing the weights of the majority of the model's layers, we retain the pre-trained knowledge captured by the model, allowing it to extract useful features from our data. We then replace and train the final few layers, typically linear layers, to adapt the model to our specific task. This method is beneficial when we have limited labeled data or when the pre-trained model has been trained on a similar domain.

  2. Retrain all the parameters in the model:
    This approach involves unfreezing and training all the parameters of the loaded model, including the pre-trained layers. By retraining all the parameters, we allow the model to adjust its representations and update its knowledge based on our specific task and data. This method is often preferred when we have sufficient labeled data available and want the model to learn task-specific features from scratch or when the pre-trained model's knowledge may not be directly applicable to our domain.

Fine-tuning method 1¶

Freeze all the weights besides the last two linear layers and train only those layers

In [18]:
# TODO:
# Freeze all parameters except for the last 2 linear layers
# ====== YOUR CODE: ======
_ = [param.requires_grad_(False) for name, param in model.named_parameters() if "classifier" not in name]
# ========================

# HINT: use the printed model architecture to get the layer names

Training¶

We can use our abstract Trainer class to fine-tune the model: We will not play around with hyperparameters in this section, as the point is to learn to fine-tune a model.
In addition, we do not need to send our own loss function for this loaded model (try to understand why).

TODO: Implement the FineTuningTrainer in hw3/training.py

We will train the model for 2 epochs of 40 batches.
You can run this either locally or on the course servers, whichever is most comfortable for you.

In [19]:
from hw3 import training

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

optimizer = torch.optim.Adam(model.parameters(), lr = 5e-5)

# fit your model
if not os.path.exists('finetuned_last_2.pt'):
    trainer = training.FineTuningTrainer(model, loss_fn = None, optimizer = optimizer, device=device)
    fit_result = trainer.fit(dl_train,dl_test, checkpoints='finetuned_last_2', num_epochs=2, max_batches= 40)
    with open('fit_result_finetune_2.pkl', 'wb') as f:
        pickle.dump(fit_result, f)
    

saved_state = torch.load('finetuned_last_2.pt', map_location=device)
model.load_state_dict(saved_state['model_state']) 
best_acc = saved_state['best_acc']
print('best acc:', best_acc)

with open('fit_result_finetune_2.pkl', 'rb') as f:
    fit_result = pickle.load(f) 
best acc: 77.08333333333333
In [20]:
plot_fit(fit_result)
Out[20]:
(<Figure size 1600x1000 with 4 Axes>,
 array([<Axes: title={'center': 'train_loss'}, xlabel='Iteration #', ylabel='Loss'>,
        <Axes: title={'center': 'train_acc'}, xlabel='Epoch #', ylabel='Accuracy (%)'>,
        <Axes: title={'center': 'test_loss'}, xlabel='Iteration #', ylabel='Loss'>,
        <Axes: title={'center': 'test_acc'}, xlabel='Epoch #', ylabel='Accuracy (%)'>],
       dtype=object))

Fine-tuning method 2¶

Retraining all the parameters in the model

We will reload the model to ensure that the parameters are untouched and we are starting from scratch

In [21]:
from transformers import AutoModelForSequenceClassification
In [22]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2)
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
In [23]:
model
Out[23]:
DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (1): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (2): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (3): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (4): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (5): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
      )
    )
  )
  (pre_classifier): Linear(in_features=768, out_features=768, bias=True)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)
In [24]:
# TODO: Make sure all the model parameters are unfrozen
# ====== YOUR CODE: ======
_ = [param.requires_grad_(True) for param in model.parameters()]
# ========================
In [25]:
optimizer = torch.optim.Adam(model.parameters(), lr = 5e-5)

# fit your model
if not os.path.exists('finetuned_all.pt'):
    trainer = training.FineTuningTrainer(model, loss_fn = None, optimizer = optimizer, device=device)
    fit_result = trainer.fit(dl_train,dl_test, checkpoints='finetuned_all', num_epochs=2, max_batches= 40)
    with open('finetuned_all.pkl', 'wb') as f:
        pickle.dump(fit_result, f)
    

saved_state = torch.load('finetuned_all.pt', map_location=device)
model.load_state_dict(saved_state['model_state']) 

with open('finetuned_all.pkl', 'rb') as f:
    fit_result = pickle.load(f)  
In [26]:
plot_fit(fit_result)
Out[26]:
(<Figure size 1600x1000 with 4 Axes>,
 array([<Axes: title={'center': 'train_loss'}, xlabel='Iteration #', ylabel='Loss'>,
        <Axes: title={'center': 'train_acc'}, xlabel='Epoch #', ylabel='Accuracy (%)'>,
        <Axes: title={'center': 'test_loss'}, xlabel='Iteration #', ylabel='Loss'>,
        <Axes: title={'center': 'test_acc'}, xlabel='Epoch #', ylabel='Accuracy (%)'>],
       dtype=object))

Questions¶

Fill out your answers in hw3.answers.part4_q1 and hw3.answers.part4_q2

In [27]:
from cs236781.answers import display_answer
import hw3.answers

Question 1¶

Explain the results that you got here in comparison to the results achieved in the trained from scratch encoder from the preivous part.
If one of the models performed better, why was this so?
Will this always be the case on any downstream task, or was this phenomenom specific to this task?

In [28]:
display_answer(hw3.answers.part4_q1)

Your answer:

We fine-tuned a pre-trained Distil-BERT model for sentiment analysis using two methods: Method 1 involved freezing most layers and training only the final ones, while Method 2 retrained all parameters. Test accuracies were more than 80% for both Method 1 and Method 2 (when model 2 accuracy was higher then that of model 1). Comparing these results to the trained-from-scratch encoder in Part 3, which achieved a test accuracy of 67.9%, we can observe that the fine-tuned BERT models outperformed the encoder trained from scratch.

There are several factors that can contribute to the improved performance of the fine-tuned BERT models. Firstly, BERT is a powerful language model pre-trained on a large corpus of data, enabling it to capture rich contextual representations. By fine-tuning BERT on the specific sentiment analysis task, we leverage its pre-trained knowledge and allow it to adapt to the nuances of sentiment classification. Additionally, fine-tuning methods involve updating the weights of the pre-trained model based on task-specific data. In method 1, we froze the majority of the model's layers and only trained the last few linear layers. This approach is suitable when limited labeled data is available or when the pre-trained model is already knowledgeable in a similar domain. Method 2 involved retraining all the parameters of the model, allowing it to learn task-specific features from scratch or adapt its representations more flexibly.

The superior performance of the fine-tuned BERT models in this specific task does not necessarily guarantee the same outcome in any downstream task. The effectiveness of pre-training and fine-tuning depends on various factors such as the size and diversity of the pre-training data, the similarity of the pre-training task to the downstream task, and the availability of labeled data for fine-tuning. Different tasks may require different strategies and approaches, and the performance of pre-trained models may vary across tasks.

Question 2¶

Assume that when fine-tuning, instead of freezing the last two linear layers, you instead froze some other internal model layers, such as the multi-headed attention blocks.
Would the model still be able to succesfully fine-tune to this task?
Or would the results be worse?
Explain

In [29]:
display_answer(hw3.answers.part4_q2)

Your answer:

When fine-tuning a pre-trained model, it's usually better to freeze the internal layers closer to the input and only update the last layers, like the classification head. This is because the deeper layers capture specific patterns for the task, while the earlier layers capture more general information.

If we freeze internal layers like the multi-headed attention blocks, the model can struggle to learn specific patterns and task-related details, resulting in poorer performance. Freezing these layers also disrupts the fine-tuning process by preventing them from adapting to the task-specific data. However, the impact of unfreezing internal layers can vary. If the pre-training and fine-tuning tasks are similar and we have enough data, unfreezing some layers may not have a big effect. But if the tasks are different or we have limited data, unfreezing internal layers can lead to overfitting or difficulties in fine-tuning due to increased complexity.

It's generally easier and more effective to freeze the internal layers and update only the last layers during fine-tuning. This allows the model to leverage its pre-trained knowledge while adapting to the specific task.